Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Ancir was from Vienna. Welcome to Comforted to Cloud
Native 2024. I'm Dmytro Hlotenko and
I am pleased to welcome you on my session journey beyond AWS quest
for the excellence. So in this session I will be happy to give you some
specific details on AWS service. Tell you a story, how I was
modernizing the already existing service and how
I was resolving different issue and I made the overall experience with
running our application on AWS much better for
our solutions team, for our developers and for our product
team. And please take a good position because it
will be session with some specific details with a real life story.
So I'm really excited to have you here and since we are limited on
the time, let's go and jump into the session. But first of
all, just a short words about of me. I'm Mitri
Glatanka. I am a cloud engineer of up it and how my
colleagues call me Mr. Amazon. I am AWS community builder since
2024 and I'm co leader of AWS user Grouping.
I am holding a few degrees related to telecommunications
and in it I'm already for a long time I saw different things
in different places, but I'm really enjoying cloud everything
which is related to architecture software and I'm really happy
to deliver not only speeches of the conference but services to
our customers and interact with the different exciting people. And yeah,
so I'm also a motorsport fan and photography hobbyist and in
terms of AWS, I'm still targeting the clothing
jacket, but this is just a long, very way to go.
And if you will have any questions after this session, you will have my
LinkedIn QR code at the end. Please. You are welcome to send
me the invitation. I'm always happy to discuss any points to help you.
And yeah, so I would like just
to announce what we will have today tonight. So I'm
going to give you some information what actually we are doing with AWS
because actually this case, it was one of the first cases that I have refactored
since I have started in Apait. I will give you brief information on application
and in general some details how the services are
cooperating, how it was built, how it was reworked,
and some small details that probably will give you some inspiration
for your workload or will give you the idea for your potential case
to be resolved. So first of all, just a short information about
my company. So via Apait we are maintaining most of the
biggest austrian media companies such as ORF,
which is a national broadcast, lots of magazines
like Derstandart. And we have not only media customers,
but we are also developing the special applications for journalists,
applications for communication, data processing, our press
agencies coupled with the different other smaller european agencies,
and we are processing a huge amount of information in order to deliver different
media to the public, to deliver the true news and
some other different exciting things. And we have actually two data centers
in bin and I'm like a critical person for the cloud topics and
with AWS. So I have already put my hand on different cloud
integrations in different applications that we are running.
So we are running lots of mobile applications on AWS. We do the analytics,
data processing and of course since we have critical appliances,
we are doing the disaster recovery. But this talk is not about it
and media publishing and processing, because AWS with these
services, it gives too many opportunities that you can utilize
to deliver your business targets and build completely different business
logics for different cases. And actually
the topic of our discussion today is meteor contact
plus. It's one of the applications that we are running. So the idea behind
was to create a centralized solution that will deliver.
For example, you have some exciting announcement of your company that you
launched a satellite or new invention, whatever, and you can
come to Mediacontact plus select the best journalists
who will take the best from your information and share it for
the good auditory.
So you just need to tell what, where and application is
built in such manner that you will just
get it delivered and your information will be processed and
spread. And actually since it's powered by AWS,
it was actually a first project that APIC
delivered in the cloud. And AWS is actually
an amazing solution. We have a few projects that relies on it and
we are powering the complete service is built on the
backbone of AWS. So this is a fundamental thing
and a good example how the single simple email service
can power not simple application. And we have
different integrations of other APA systems just to
make sure that messages are delivered properly
and they have resonance measurements. So we are processing lots of data which comes
from AWS, SAS from AWS cloud from,
and all this data is processed by automation. I'm going to present to you a
bit later in this session, but just there is
a unique integration of AWS solution in
the real product, but also we should
couplely integrate it with the real people who are working in the product
team in Apple Cs. Yeah, this was running since 2021 and
yeah, outcome of why actually this session is here today that
I was trying to maximize the outcome of
usage of AWS because initially it was built not
by me and I just took over the project and just made
it better. Yeah, I just came, I rebuilt and now it.
Yeah. Just a few words on AWS.
What I absolutely love it. It does not create any
headache. For example we have lots of customers,
I don't want to lie, but lots of big companies who
are already doing sending through it and setup
is pretty easy. You just have to follow some small things like the mark setup,
you have to make the proper DkIM alignment SPF records,
you have to make sure that domain is validated
properly and you have just to follow some small guidelines
from AWS. For example, if it's marketing you should include the unfollow
link in your body of the message. AWS controls this
and you should take care about it because we
have to fight this pump together and setup is extremely
easy and it does not limit you in any metas.
You would like to have it. So you can use it like a normal SMTP
client or you can use it like API integration
for your application and of course it can be a part of complete
serverless project. I use Seth also for my own purposes,
not only for the commercial ones. And what actually I hate about
of it because it's like a love and hate that's normal in any situation with
interesting things. Complaint rate is a bit specific
because for example you might be complained because
the IP address from which SS is operating, it comes
out from some strange lease which
are not even accessible directly by the link.
And yeah it may hurt your delivery rate. And this is
important because actually this is a messaging thing and you want to have the
things delivered and yeah but out of the box
you don't have any logging, any tracking, you have to build it
ourselves. But I will show you the example, maybe that will also
be useful for you. And yeah, as it says it's simple service and
it just gives you sending the functionality. And I'm
pretty excited about of it because it's a good foundation for the things and
I don't think there are much better services who can do the same without
comparable effort. But you just need to
take some things into account. But proper domain configuration
on AWS is essential and that's
why we take some time to track the delivery rate. We take
some additional notices because of it.
But in general SAS is amazing service. So if you just have to
send the email, this is the way to go. And going
a bit back to the global vision of the application.
So you're coming to the company. So let's just imagine
the case you have the perfectly working application. So you are a
cloud architect, engineer, Ops position doesn't matter.
And you see that. Well,
since it was the first experience for your company, probably it
was built nice, your colleagues are great people and of course
we can leave everything AWS, it is don't touch if it works. This is like
a mantra, but this is not about of us. We are interested to
gain benefits of the cloud for our company, for ourselves,
and be more developed and be cooler, specialist.
And of course we can also drink coffee and relax. So if it works,
why should we stress? But yeah, this is not the way and this is why
we are starting the analysis. And yeah,
this is our work and we don't have to yield on the cloud because cloud
actually is amazing. And what
I just would like to underline that.
So first of all, the best way to understand what you
have is just see how it runs, what are you
running, who is running it, and just talk
to your colleagues because they have already done it, of course they have analyzed the
situation, they have took the decisions. And first of
all, you have lots of different information. Data is
extremely valuable in this case. And thankfully
AWS gives you lots of opportunities why
and where you can have this data from the different perspectives and only
from AWS services, you will have already the half of information
that you will have. So just let's have a quick look.
You have to talk to the people, this is extremely important.
And you cannot just deliver the thing on yourself and make a statement that
this is this. Of course if you have some authority, you can
do it, but it still be easier. It's already running
and you must see why and how it runs. So Cloudwatch
is amazing thing. If you will just take some time to
analyze the logging, analyze existing metrics,
maybe onboard some additional metrics, you will have lots of valuable data
because then you can process the behavior of your customers
which are visiting the application, you can see how good are you
scaled or how good are you utilizing the actual resources
that you have created. Maybe you have underutilized environment and
you can just slice the size of your instances or
for example RDs, whatever you have and be much better.
Insights and dashboards. It's a very amazing thing to gather
all the data together and with the insights you can run
the queries, for example, for some specific message, for example
that comes out from your application. And of course you
must be connected to the cloud, watch for example with a docket driver
or something else, whatever you use. And one of the things
that I would like to mention, and I will mention separately is AWS fault injection
service because despite any application can be running perfectly.
But you don't have to
be surprised. Of course it's nice to have a surprise, for example, some BMW
parked in front of your windows. But when you are waking up at 02:00
p.m. From your phone, from your operations team, that is broken and you
don't have recovery plans, you don't have anything. So fault
injection service is amazing because you can simulate breakage of AWS
things, you can break your application, you can break the
networking and just understand what is missing, where you have
the missed observability, where you have missed maybe some functions
in your application, what is missing. So it's amazing tool
to remove the blind spots for you.
And yeah, I still have to admit that
Apache G Metro, it's a good old tool,
but in this case I also use the Robert framework and selenium
to simulate our users and to understand how much
and what we can run and in which way.
And of course since also the part of this talk is
how to comply with the existing processes,
I really like the cloud watch
because yeah, it basically does lots of things, but checkmk
out of the box provides you most essential
things that you would like to monitor in your AWS account.
It does after detection of the most important things and
it does pretty amazing job that you can just have a look how it
works and use it like a foundation for your observability pipeline.
Yeah, and from the people. For example, in this case I was deeply
working with our developers, with our solutions architect,
because he was maintaining the setup before it
was important to find the reasons. So why for example it
was not made serverless or for example why this
certain database was picked or whatever.
And also it's very important to know your customer,
what he will do, in which amount, how he will do it. This is
why you need to communicate with the product team and the second reason why
you have to communicate with the product case. You can do a lot on AWS,
but as you know, every press on AWS costs money.
And your target aws a cloud architect not only to deliver
the efficient solution, but also make it good
from their cost side. Because your business thing
must be profitable because otherwise you will have no job and
you must take lots of things in the cloud from the technical perspective,
from the other perspective, you should see what you
already have. And I have to admit that one of the best cases
to understand what you have, especially if your team initially didn't
use cloudformation on terraform or Opentofu doesn't matter.
You have to understand with what are you working, how it runs,
where it runs, how much data it consumes
consumes. And one of the best ways to understand what you have
is just to use infrastructure as code. If you
already have the template, you are lucky you can just go through it,
see which resources, what is where connected. But if you don't
have it and you are coming to the blank project with a minimal documentation
that has just some charts of the setup or whatever,
and it's an amazing way to get understanding
what you run. But also when you are
working with AWS services, you don't have to reinvent the wind
wheel because most of the things that they are already created
by AWS. For example, you don't need to run the EC
two with the database because you can go with the RGAs and you will have
the managed database. Or you don't have to run some specific
things for EFS because you can use AWS EFS
or you don't have to set up extra monitoring. You can try to
utilize cloud watch natively in
your project, which is amazing because you have lots of
metrics you can process and Et cetera, et.
So for example, if AWS says that, as I
already said with the example of database,
that you can use RDS for your things and
then you should prefer to take the RDS. And actually one hint
for the certifications, you should know the AWS
way for the things because it basically pops up in
the most of certification fashions. And if you will be
getting AWS certified, this is essential not only to
use the services but know how AWS will encourage you. And I
also remember one good example is to have for example,
not host the Kafka on
itself. You can go to AWs, MQ and etc.
And automate the routine because time is gold and you
get paid for your work. And for example, you don't want to come to some
certain account and press some button because nobody of you
know if it can be automated, it must be automated.
But don't automate everything, understand the value of
the things because when we are talking about every implementation, you should
understand that effort also costs money,
because every service also costs money. But you should find the
spot of the balance, how much effort you will invest and
what will be the output from this and actually
finding the big spots. Also a good point because application must
be resilient. You must have almost no downtimes
as much as you can, especially in the business times, you have to avoid it
and for example, you should see from the security perspective,
you don't have to have the exposed credits or
have unpatched things and some
other things. And for example,
also going forward, if you can save on some things by using spot
instances or by using different type
of the instance or some smaller scale of RDs,
because it's just underused and you have to use the
resources for what are you paying. So maybe you can optimize
it this way. And of course security is important and thankfully
AWS gives you lots of good practices how
and what can be done. You can use additional things and basically
the baselines for architecting and service implementation.
You can get AWS secure as you can, but you still
have to be careful and include security on the stage
of software development, building, imaging and deploying
it to AWS. So going forward, let's talk
about the evolution, actually what we did and how I did it.
But first of all, I would like to remind you that this is the
first cloud deployment that was in the company,
and we have already developers, lots of amazing developers
who work with lots of things, but they are used to
work in some specific ways. And what is important
in every relationship doesn't matter, romantic business, et cetera.
You don't have to scare anyone and everything must be done softly.
And our target as a cloud engineer is to
precisely, but very carefully introduce the cloud things in the
company because otherwise things may play it against of you and
you will have to deal with some VMware host and call it
cloud. But yeah,
so the target here was not to cause any additional
headache for the developers, just to give them what they have
and do as much as we can on the side of the cloud completely transparently
for them. And just if there is an improvement that will minimize
their work, just notify them. And goal is to
remove as much as blind spots as we can and improve efficiency.
Because efficiency is the most important for our application
when we already know what we have, how it runs,
so what it needs and who uses
it. It's nice to ask yourself such questions.
So of course, as I mentioned, we don't want to break the processes.
Can we run more cost efficiently? It's the most essential because
there are lots of opportunities to optimize your costs from using any
phenotypes tools, from just having a good engineer or
just for example use the AWS config. And there is a case
when for example, the AWS config
was able to find lots of unused snapshots of EBS that
were just burning the money and this is why you also have to use it
and wealthy for our application that we are running on
the proper instance with the proper database size. So we don't want to underutilize
or overutilize the things. Right size also belongs to it,
set up reliable, it should be taken into account for every
production deployment because your service must
be steady AWS you can and for example this is why I
will go forward and show you some things
that directly influence the liability. Monetary coverage
is also essential because then you might know how to behave
in some situations or just to get notified
to minimize the downtimes. This is important. Security as I
mentioned also is in this list and target
for me was to get rid of operational overhead and this is why I
brought some of automations to this project. And let's
have a look. So this is the initial setup. Basically it might remind
you the most simple or very, I would say vanilla
deployment of AWS. We have just something running on EC two
with something. So we use the code,
commit, code build, we build the things. We have a few staging
accounts so everything is pretty simple, it just
works. But this is nothing fancy. And of
course we can see some things that we can use, the private
subnets, we can use, some other implementations
we can utilize the web, we can use two RDS
instances, second one for the standby to improve the resiliency because
then you will have a short of a lower time. And you
can also think about the backups and maybe some automation
because right now on this stage there is no delivery of the things
coming from the main account. So as I said,
yeah it was basically we were just running containers on
the EC. Two instances because this is like a normal Linux
virtual machine and you can have a normal docker who everybody
is used but they are on demand and we have to ask them
is it actually a right thing for us single RDs,
it's one of the most dangerous things that you
can have because then you will have for example 1520
minutes of failover and especially if you don't have
the non credit system, especially for your
storage or for the instance type, if you don't use the production grade
type you can stuck to the credits and if you don't monitor them you
will be very unpleasantly surprised. No automation,
no comment on this. So provision time
also was essential because it affects the recovery time and prism
was a way how to improve it. And it came out from analysis
of application behavior. I went through the startup
process but yeah there is also the way monitoring
was extremely basic we were just checking the endpoint from external
endpoint of the load balancer, not even the application itself. And some
things, for example we don't use security, we didn't use the security management
or secret management. Vast was missing and
some small things. And to redeploy the thing you have to actually come to the
console and do the click ops. Well of
course it can be easily resolved. But what
I would like to underline again the target was not to break the existing
setup which was working and with which developers are familiar.
I didn't want to introduce any extra things or new services like
ECS or whatever. The target for me was to keep the same baseline.
Of course there are lots of things that
you can do and please have a look on the current setup.
As far as you can see lots of things appeared. And here
we have a different designation of the running
application. Core subnets were moved, we have different database,
we got the EFS for the caching of the shared data. We got
some security checking by inspector, by the
config, by guard duty for this
application we got additional integrations which just uses
the data that comes from the application, from the Amazon services
that for example as far as you can see on the right side,
they can contact our product team or they can contact me
and the building process is automatical. And the most
important that the whole setup was fit in
the processes of the company, how it was working already
for a couple of, maybe a long period of time I would say.
And yeah, so what we get after the transformation,
I have improved the performance even because we were able to,
by reducing the costs, by changing the approaches
we could take new things that probably
would be better fit for our application. That's why you have to analyze the
performance and we deeply increase the monitoring
coverage. We control the metrics from the cloud front, from the
RTs, from the application itself, the things
that can come up in the logs of application there is a responsive
SAS failures, the connection and actually
there are lots of valuable data that you can just go and collect around.
For example like mushrooms after the rain.
And now we have fully automated
deployment which companys with the ICL process and
complies with the whole change process and everything
you have to do just to approve it and
automation will just get the deployment done for you. And if something goes wrong
it goes back. And security and resiliency
was also improved because we don't have any exposed things,
things were more hiding. And we
have a few additional security checks, conformance checks
and this is important. Yeah, as I mentioned already SAS events
and some other of the small improvements.
And just going back to the scheme, I just wanted to mention that this is
a multi account architecture. So we have a main account
that runs the production, we have the staging account,
best account for our developers, and we have the separated
account which acts AWS repository.
And there is a big conjunction between those
things, but it lets you keep the things more tidy because
then you can have a lot of the small things
just more separately and you will have
less points of the breakage. And what I also important
to mention that it's not only about controlling
the stuff in AWS. As you can see, we also control what is
incoming from the systems on which the application so
the next part of this presentation, what I would like to tell you so
sadly I cannot ask you, but please write to the comments write to me to
the LinkedIn what is your favorite AWS service?
Mine is actually it's a bunch of services which
are building the basement of the idea that AWS
is basically a Lego and you can take any service and
make it yours. But for foundation of
any expansion and interaction with AWS services,
it's Lambda Eventbridge SNs and Cloudwatch locks.
This is the amazing Ford that I would say. And lambda
is so amazingly integrated with the different things and
just let you do the thing that you want. If you just spend some
time or coding or asking Chen GPT of course,
and you can automate this stuff. With the event breach,
you can have communication with the SQS or SNS.
And this is why SAS
does not make much headache for us, because we can react to
the bounce events, we can react to the complaints, we can process the logging,
we can gather the data out of it. But since it's
simple service, by taking lambda and some of the things,
and not only on the SAS, actually this is
this. With every AWS service you can
create amazing things, but you just have to understand what
you want and what you need. And if you want to go even
deeper, you can build the step functions. I absolutely love step functions because
they are covering lots of amazing things for you.
And what is
important, that you can also use the s three for caching or dynamic B for
the storaging.
AWS gives you lots of opportunities that you have just to come
and use. So let's go forward and yeah,
as I have already mentioned about the AWS fault injection
service, it allows you to break parts of the setup
granularly, so you can break the networking. You can kill the database
if you have the microservices and actually,
unfortunately, I forgot to mention that MediaConnect, plus it has
three tightly running containers and one node and
those microservices, they are taking care of different parts of application.
And for me, it was essential to understand
how it would behave if one of the microservices was
took out. And fault injection service
actually saves lots of time. You don't have to automate the things,
and it's a game changer for observability coverage,
because you don't have to wait for the event to be happened. You just
can simulate the event, do the analysis, tell the
outcome for your developers, understand how it would
behave for yourself, what is Missy can just do it and it
can automate some routine checks or something else.
And yes, you can expand it. You can write systems manager
pipelines that can come to your host or to
your container, do some different weird things.
And most important, it rolls back all the changes what it does.
And it's even cooler now because it's
a part of the resilience hub. And resilience hub,
it not only allows you to break the things if you are just
starting or if you are just not familiar, it gives you advices
how you can proceed. So together with things, you can
do a huge work which is basically called house engineering,
and you can get ready with this service. And then what
I also really like what is really missing. If you are
running the database somewhere, I don't know where you can
run it. But why? I like RDS so much that it has performance
insights. And it's absolutely lifesaver for performance
troubleshooting because you can just come to performance insights and see do you
have the problems which are related to your instance,
or if you have problems that are related to your application.
For example, if SQL query was not built properly
by your developer, it's amazingly simply to set up
with the postgres, it's not an issue. But for the MySQL,
you have to have some certain size of RGS instance. I guess it's
bigger than the medium. And in our case,
unfortunately, we had the RGS performance issue,
and performance insights showed us the exact
point, what was not running properly from the application side,
and why the current AWS setup was not that
good. And basically performance insights,
you just press the button, you modify the RTS instance, you wait
just a few minutes, and then you have the amazing source
that can help you resolve all the potential issues.
The coverage is simply amazing. And most important, it's cheap
out of the box, you have seven day trial. But for example,
just for three or $4 you will have three months of this service
and you will have a big backbone of your
data gathered from your real time running application with the
real users and you can do long term observability
if everything comes in. And from my perspective,
performance insights is even more valuable than the RGS
performance monitoring itself. And it's
so amazing and you have to have it activated. Yeah,
and going back to the swiss army knife,
this is an example of automation that we have
coming from AWS, SAS. SAS is doing some action,
email sending whatever it makes the record to this logs.
We have even breach rule that for example
once a week runs a lambda that does the
query over the cloudwatch logs,
creates the report that then goes to our team. Then you
are eliminating the thing that your colleagues must be AWS
proficient to come to assess and whatever with
this, using AWS services you can build thing
that will be familiar for them and then they will be happy that
they receive the data that they understand and they don't have to manage
anything. And also going a bit out of
this since you can grab any data from the SAS
and this is why I build different dashboards
and my product owner, he just comes to his account,
not to his, to our account, sorry, that runs the application. He comes to the
shared dashboard and he sees the sendings from which customer
where if we have had the bounces responses and
also because of it, there are a few more lambdas, we can mark
some recipients AWS broken whatever.
We can also interact with a database to give our
application better response. And for example,
if it's hard to re engineer the application and you need
to have some specific function and you have it in AWS. This is
an amazing example how you, the cloud engineer
can expand some specific backend functionality
by not making any changes to application. But you will have to think
resolved. But this is simply amazing. And this is why you have to
use lambdas together with the rest of the services on
AWS. So coming back to the media contact class itself,
I would like to talk about the running the stuff and what is
the most important about running on AWS is
to understand if you are taking the proper service for
doing the thing that you need. And in this case we were running the application,
it was already dockerized, it's microservices,
et cetera. But we have a few opportunities which is actually
just running the EC two, which is for most of the people
could be scary because of the management, but it's not that
bad actually. You can use the ECs and I really
like ECs. It gets just the thing done.
But unfortunately it's not the case here because of course this is
a new service and it's new for the team
and it just runs differently to the Kubernetes
for example. And this is something new and this is not like
a docker that you run. But I highly recommend use the
ECs on the Fargate. It must be one of your first
considerations if you want to run the application on
AWS, then of course you have the EKs. But you have to ask
yourself if you want to have the EKs because it's
real life Kubernetes. And from my perspective it's a bit
built on the site from AWS. If I would say that ECS is
AWS native, but EKS is not
that good integrated and you have to understand.
But for example, ECS might be amazing for your small
application or for some small business needs. Meanwhile,
EKs and carpenter, those amazing things
for the big scale and et cetera. And if you
go even bigger, you can have the RoSa, you can have the
openshift on AWS. But yeah,
this is the real enterprise thing and I don't think that you
will need it. But I would prefer the RoSa to eks
because it runs so transparently and it works really
great. And you don't have much vendor lock to
some specific cloud provider like Azure,
AWS, GCP, whatever. And what you also
make take into consideration. And it was important for
us to avoid vendor lock and vendor
lock. It's the thing that you don't want to
link to something. And for us what's important is that we
can anytime take out the application from AWS, put it
on our data center for example, or to relocate to it,
to another cloud, but just in case.
And the
target was to be as flexible as we can and
we wanted to have the control over the things. That's why I was
not changing the underlying approach that
we have already had for a few weeks, sorry,
years. Not only weeks. Yeah, it's much bigger.
And when you are taking the service into the consideration, you have to
understand the balance. But AWS is doing amazing job
and most of them are cool. You just need to understand how to use them
properly. And this is why you need to work with the services, you need
to know the specific details and et cetera.
And since we are decided to stick with the EC two s, I would
say so, yeah. As I mentioned, ECS is a new technology I
wanted to avoid and EKs is just redundant. And unfortunately
this project was not that big and it was kind of limited on
the budget and EC two,
they are just running the well known Docker and three microservices on
the host and they are pretty amazingly fitting on one instance,
but I have just changed the approach. Initially we just had
one node that runs altogether and then the
thing is we have a different load on some specific microservices.
We couldn't scale them independently. If we scale we get the
same node with the same things and I
had to a bit divide them,
but still we have the easy to in the base and this is not that
scary and I will show you why. But first of all let's see
why budget was heavy consideration for taking
the baseline for our application. As far you can
see I would like to take the bare price of EC twos
which are just provisioning the application as the baseline 100%
savings plans. They are amazing. You are not stuck to the reservation and
for example if you have the organization and you are running most
of your applications on T three or M six
c five, whatever and depends on what you need.
But TT three t three A is the most well known case.
You can purchase some of the savings plans, if it's underused
you can share it on just switch to another account.
But what is amazing, you already have about 40% of
discount and the only way how to make it better is
to switch on the spots. But some of the people, they are afraid of the
spots. But you shouldn't and I will show
you why. But just a few slides after and Fargate
Fargate is a cool service. I really like it because usually you don't
have the operational overhead, it just runs your stuff.
But please get ready to cover
the bill because if I would take the fargate for
our application my product owner would kill me because the
pricing is about 473%.
This is what I mentioned. The balance is extremely important
and you should understand how your effort and price for the
service compliance. Of course you almost have no effort, it runs it for
you, but in cost of extreme volume.
For example, if we still take the EKs with easy to workers,
of course you can make the discounts with the savings plan, compensate the
pricing for the eks cluster, but you have the overhead.
And you should ask yourself do I really need a complete kubernetes
for some business purposes? Maybe it can done
by ECS or even in this way.
Yeah, and just to mention, of course I have resources for
you. So Kubernetes instance calculator is amazing because
you can tell it that you are using the EKs and it will give
you the advice about the instance per usage which the ones you
would like to take for your EKs cluster and how your applications
with certain limits are fitting on your cluster. And another
amazing thing is Fargate pricing calculator. That AWS,
you know, AWS calculator is a bit confusing, but Fargate
pricing calculator, just a few clicks of the most essential information and you
get the pricing if you really don't want to manage
anything and you want just to have the things running.
But in our case we decided to still stay with the Isitos.
Some of you might say that we are crazy, but it's a working way
and we have built some our system that
does our thing and with almost
no difference in the pricing. So we have
amazing performance but we don't pay as much
and we don't have any operational overhead because we were just
well prepared for it. So as I mentioned, for us
was important to keep control on the host because we have some specific
data processing guidelines. We have to be compliant
with some of the things because we
want to have the control despite we really trust to AWS,
they have a very high security standards. But yeah,
we want to know what we have running and where we have running.
So also for me was important consideration.
For example, what will be the benefit if I will move everything to
the ECS, how much time I will take for this and
for every action that if
you would like to rearchitect something you should have this significant benefit
to do it.
Also we have a tightly coupled microservices,
but we still have to them like alive
together and we should have the load properly scaled.
And I wanted to avoid the cases when for example we hit some Kubernetes
limit and it's popped out or the host is
over provisioned. And also
why you have to understand the instance type that
you use in case of MCP. It was very CPU gentle application,
but it was very heavy on the memory usage and on the
input output and networking. So for me it was crucial to use
the instances with a better network performance and
better memory management rather than CPUs that we will
just not, sorry, we will just not use.
And yeah,
people would say that it's hard because you
are just working with a plain host, but thankfully AWS
even thought about us here. But if
it's just a small thing which can be done in
a few hours, why should we spend the extra money that can be spent
for something else and improve other perspectives
in your project? And if we are talking about the workloads.
One of the first things that are coming to the mind,
to the mind, if we have some
instances in the consideration that we
can use the graviton instances, and they are really great because
this is an amazing example of application of iron technology.
They are giving you a good performance for the good value. But the
problem with the gravitons actually is this
one on the slide, because yes, despite it's
a containerized application, you have to take
care into account that your dependencies are able to
run on IRM, that you are not using any specific
libraries that can rely, for example on AVX extractions.
That absence of AVX instructions can cause a
huge performance drop for you. And you
also have to use some extra things like Docker x or
Jeep. And if you already have too many other applications
that have a huge dependence on your Jenkins builder, you probably
will not want to change things. This is why I think Graviton
should be considered from the beginning. But if it's already in
the middle of the bay, it's a bit hard to make the translation. I also
heard there are some cases not only with the specific integrations,
but with some specific monitoring tools, et cetera,
et cetera. So yeah, you can save the money on
running the graviton, but you
should also understand that you
should understand if the effort to make your application
running on gravitons can be covered by something else
on AWS. This is what actually our example is.
I just took another approach and saved some money, but coming back to the
graviton, so maybe you have to solve the
base image and if you use Amazon, Carrera or something else,
you have just to rebuild it. You have to done long term testing.
Yeah, as I mentioned, some things may be broken, but if
you don't have something else, you should try to use the Kraviton
instances and you can try architecting and planning from
it for the beginning. And another thing for me was
a bit hard to motivate the colleagues also
to change the pipeline. Despite of some financial benefits,
you still have to take care about running, testing,
et cetera. And just taking graviton instance doesn't
give the benefits out of the box and it
will not do the things for you.
What will we do? I'm just redelegated
the things. And of course Spotfleet
is amazing things that you are using the microservices and you
want to be scalable because if you are afraid
about taking out the instance from the service, you should
not. And first of all with the spot fleet
you can have the main on demand instance that can have the savings
plan and it can be reserved just to have your things
running. And we have a different hosts in the separate auto
scaling group that are attached to our auto application
load balancer.
So you can always guarantee your availability by
having the spot credit. One instance will always serve the connections
for you and the rest of capacity during the business hours can
be gathered with the spot instances that are based sometimes 60,
70 80% cheaper than on demand.
And even because of this you can deliver the better experience
to your customers because your application
will be running on the faster hardware but you will have even less bill for
this. And yeah, so for me it was important to slice
because of not consistent load on the microservices and
yeah, so we have three groups serving the things but you must give
the try to the spot fleet. If you are having the stuff on AWS then
yeah, as I have mentioned, in comparison with
the on demand you can save over 60%
and you can also have the savings plan on your on demands
and you already have a big savings that you can use for example for security
services or increasing the observability coverage.
And it's simply amazing that such a discount
is present. And you must be not afraid
about the termination factor. Why? Because you have
the notice from AWS and you are backed from
two sides. First one that your application is still running on the on demands,
but from another side you have a message coming from AWS. Hey,
we will take this instance from you so you can be ready,
so you can panic, you can do nothing. But in
our case on this slide, as far as you can see, application is interacting
with the secrets manager. It's not about it, it's just about the security
with the Aurora. So we got rid
of RDS and switched to Aurora later during the best and
application knows where does it run.
So important tasks like database migrations
or some batch things in the progress they are coming
to the main EC two instance which is on
demand, it's always here but interruptible
or some short living tasks, they are coming to the spots.
So we're having the free app on the main instance and rest of the
unnecessary loads and they're handled by them and your application
just knows that it's in spot. We have a pool,
they are interconnected between each other and two minute
notes is mostly enough to finish your things or
have it saved in some certain space state that can be
taken and you can continue with the processing the data.
So this is how you can not be afraid of
using the spot instances well what is actually the
next. So for example,
since we are running on the isitos, the node provision
time is crucial here. And since
you are using the auto scaling groups and some small other things,
you are also covered from a few perspectives here.
So first of all you can use the scheduled
scaling. So as I mentioned, it's important to know your user.
So for example you can start provisioning extra
nodes before the peak business hours and then when
it goes on the deck line you can also
have them going down. So you will be already prepared. You can
use the performance tracking or some things
that for example you can use the metrics from the Cloudwatch to
scale your application. But what is also amazing from the
auto scaling group site is a warm pool. With a
warm pool you can prebake your instances that will be just
taken away from the service, but they always be ready to come and
help you. And I don't know why but
yeah, so it's not that much.
Spread it here. And in our case if
we are running the application on EC two s for me the lifesaver
was to use the easy to image builder and caching of
the shared data on EFS. Also helping because you
just don't have to have some same things, for example of
the same picture like on every instance. Why not just tell
me, come here and look on the picture. And this is what you
can do with the EFS. But be careful, EFS sometimes
has a very weird performance because it's a net attached
drive, but if it works for you then use it and
don't duplicate the things. And coming back
to the EC two image builder. So initially
application node was taking about 500 seconds
to boot up and then be alive
in the service. And of course it solves the
updates, it fetches the images from
the ECR, it just comes and running.
So the first idea was to provide always a fresh image
for our system. Of course you cloud swap the AMIs,
but EC two image builder just rebuilds it and
it interacts with our system
management. Done by for me that I open the change request,
my team lead approves it for me, sends the
callback to me and then I just get a notification that
system was rebuilt and then we provisioned the new nodes with
a fresh image. So we are already saving some time because
we have the good image. What's next? The idea was
that we can also cache the docker images on the host, not only
by the docker itself, because you are starting from the scratch,
you don't have any caching out of the box. For example if
my developer changes something in one microservice but the rest
of them are not changes. So we just refresh one service and it already saves
the health of the time that we have. But I also went
through the application startup time and
since it does gathering of caching from the database and
from other things, EFS stores the things that are on the hot
plate and we have the response time now for
the application even better than if I would be running it on
the yeah,
you may see the ELB didn't like so because such
big startup time out of the box was
delaying the default health check grace period.
But actually what I also would like to mention some people don't
know about of it. There is an amazing service from the red Hat and
if you already having the red hat stuff running on your environment
on your on premises. If you have some red hat subscriptions,
what you can do, you can take your subscriptions to
AWS and then you will have the access to bring your own
license service. It's extremely easy to set up. You just
connect your accounts to the red hat. You can use the cloud formations
taxite set to permit all the necessary things that
red hat integration wants from you and you just come to the red
Hat council give to some accounts and you already
have it. And as a bonus you have the management.
But what you have to understand that you have to
do some tweaking. If you are red hat administrator you know what to
do. But out of the box bring
your own license images. They are not that good for intensive auto
scaling. So you have to reduce some configurations which are related
to the system activation and things.
But if you have your licenses, please use the cloud
access and you will have your red hat stuff on AWS and you will be
not built hourly for the war by second since some
changes are upcoming. If you
want to use the red hell for the things. Yeah, and actually
regarding the building of the image, this is the scheme about
the stuff that we have. And as
I said before, we have a few accounts we have for me for change
management and resource management. So what
is happening? For example, we have a few sources of the triggers.
We can process the release of the new version
in the production branch. We can interact to
the Amazon inspector findings. For example, if there is
a critical vulnerability, it will even skip the change process and patch
it immediately. If the testing is succeeded or
if red hat releases the new image, then even breach takes some
messages coming from the services, then some
automation which is done by the state
machine and lambdas are coming to the game.
It builds the new EC
two image with the fresh image, with a fresh application, with all the
updates installed, updates the dependencies, then provisions,
the application runs the automated testing which is done by Lambdas
too. And of course it's tightly coupled with the SNS
and SQs. So basically with these services I have built
the whole pipeline that I would be doing myself by
hands. And when testing is successful I
get full image that I'm ready to use and
I get the callback from automation that hey, everything is good, we are ready
to set. And then it makes the change request, I'm not involved.
Meanwhile nothing breaks and it didn't break for the last year,
it just works. And yeah, then approval team
comes for me, receives the change of the ticket,
sends to the specific endpoint in the separate account
it says scaling group. Hey, this is your new image. I changed the
template for you with the lambdas. Bot tree is amazing. And we
have the new version and it's live and it takes into account the business hours.
And basically this is what we might have if we
just use some other ways of running the things.
But since as I mentioned,
we have amazing not provision time,
we are very fast on having the application up and running, we are very
fast on reacting to the changes or releasing
the new versions because it's just like a river.
If things are coming, they are floating and they are resulting in the
outcome. Yeah, and what I also would like to say
about the testing process, I test the application on
the host and internally and externally
just to ensure that we will not cause any downtime by the
updating.
Another thing that I would like to talk about of you today.
So I think you like the surprises, but out
of the box what Master wizard of creation of RDs
on database doing. So you might
have the GP to drive, but this is not what you want to have.
This is a reason why you may have a very big
surprise which actually looks like this. Just look at this birth
balance, look at those huge waiting time on the
database. This is crazy. And basically if you see this,
it means your application is not working, but vibe
burst balance. But you might
ask okay, but what can we do with this? It's pretty easy.
And thanks to AWS again they have introduced GP
three, which is amazing drive which is able to provide you
consistent performance without of any limits and it's even cheaper.
And you know what I ops you have,
you know what you see. But what is important
about GP two, you might know that it has some certain amount of
I ops and read out. But to achieve the performance,
GP two must be stripped. And to achieve the stripping you
must have about 100 gigabyte drive.
And what is essential with the GP
three, it works amazing on the small drives like assistant
drive, 1020 gigabytes and it's
cheaper. What is the most important and it's just running.
And if you need a small drive for the database instance,
because I o one, I o two, they are good, they are giving
amazing performance, but they are expensive. And what
you have to try is to check if GP three can uncover
your needs. And of course what is
good about the GP two. GP two might
give you better performance if we are talking about terabytes.
But it's not a reason to use it for your RDs,
because you want to have your RDs alive. And the reason
why GP two is bad, it causes I owe weights and
drive is just inaccessible when you have running
out of the credits. And if you don't monitor it and
it can be just a huge surprise for you. AWS for administrator
and why application is down. And this is the first things that
you have to check if your RTS drive is encrypted. Because by default
encryption is also not enabled and if GP
two is not used, guys please just use the GP three
and you will be happy. In the most of the cases going
forward, as I mentioned,
if we use the Aurora or RCAs, it's important to control
all the credits for the CPUs and for the storage. Just to understand
if you are covering the needs of your application during the running.
And also what is important from my perspective, from my observability,
that metric, so called metric DB load is also extremely important.
So if you don't have the non SQL load, it means that you
have the load caused not by your application, something goes
wrong on the host. And mostly if you are running out of storage
or you are low on RAM, or if CPU is just not enough
to run your queries. Swap usage is
actually a bomb that comes together with the GP
two thing. Because if you are low on the memory
and swap is running and you have GP two, you are just using
your credits that might be used by your application. So you
have the double burn of your credits and it's
better to scale up a bit, give it a bit more memory
rather than have the swap used on the database.
Of course it can be a bit like 25 100 megabytes, but if
you are talking about the gigabytes, it's a way to the catastrophe.
Latencies are also important because the response
time directly affects your application performance. As you can understand
with the latency comes also
the general database performance and number of connections is essential if
you are working with the lambdas because every lambda is a new number of connection.
If you have the steady application that opened the connection
keeps the session. This is not an issue. But with the lambdas
you have to keep an eye on it. So going forward and we
have already changed actually the thing. So we have changed the
drive for the RZs, we have get rid of the swapping,
we have a bigger instance, but we still have the weights and
we don't know what to do. There are a few ways of course
one of the things that AWS promotes to be
used for improvement
of the performance is the usage of the red replicas,
but it causes the application re engineering and
your application must be ready for the things. And for
example for this spring boot, if you use the native connectors,
you can mark the data as read only in the scheme of
your database. But you have to come to your developer and say
do it. And they will say no, we have no budget.
Then we can upscale RDS but we cannot be going
up like for infinity. You can
just find a good instance for you. But if
it's not enough, what's then? And for example,
some people may be confused with the RDS proxy. RDS proxy.
So it's a good thing for the Lambdas because it does
the connection pooling, but it will not improve your read write performance
on the RDS instance and call Kevin. If you have a good technical
account manager and good solutions architect account, maybe you
can ask them. But yeah, it's better to drink beers together.
So what's the solutions for this? Yeah,
as I mentioned, read Replica goes to
reengineering RDS. Proxies are amazing,
but it will not help you because in this case
it can help only by reducing the failover time up to 79%.
And I really tested it. It cut it from five minutes to 1 minute.
But you have to pay for RDX proxy. And if
you will go outside of AWS and look around there is handle proxy.
But why do you want to pay huge money for some proxy
if you can re engineer your application for this money? Or how
can you be ensured that it really does the correct and
there is amazing thing that AWS
has. They have Aurora and they have the Dynama DB.
And if Dynamadb just stored the things for you,
Aurora resolves the biggest issue of
every database, that it's steady and not scalable if in our
case the application site was not an issue, it was
bottleneckled by the database things. And Aurora
brings another very cool thing to it.
But just before I switch to Aurora,
I would like to mention that you may ask, how do
I okay, I don't want to use the Aurora, I still
want to stick to the RTS. But which RTs
instance I would like to pick and I have done some long
running evaluation of the data that I was gathering from
the performance and
I would say if you want to have the RDS instance,
take the graviton instance. RDS is the
best application of graviton that you might have
and t they are pretty
comparable by the Vram by VCPU number,
but m six g, it has a bit newer graviton and it's
more pricey. And the question for me was for example,
what I would like to do with this. As far
as you can see, t imsec six g they are
giving amazing performance and you don't want to have the t three for RGs
instance anymore because all of them are cheaper and running
faster. So just forget about the t three. And if you are
mastering the best credits and they are amazingly stable
and you don't overuse them and you know, patterns of applications
and your number of active users doesn't cause any issues, you can
save up to 50 $70 by
using T 4G instead of M six g. Despite this
is a recommended production instance by AWS. And yeah,
so you can see some additional metrics. I will just hold on here for the
second, but sometimes the performance of graviton instances for
the less price is twice faster than class 63
instances. Yeah, but as I mentioned,
you can take the M six G as a starting point for your
RDs. You can take the T 4G large if you
are good with the usage,
but all these things, they don't resolve. The most crucial
thing that I don't like about RDS, that you have the time
period when it's lighting, but you have the fully burning
database and ready for lots of customers. It's basically like to
have the heating switched on when nobody is living for the months in the apartment.
It's just a waste of money. And this
is what Aurora absolutely transparently resolves for
us. And thankfully we are speaking about Aurora
serverless V two and Aurora serverless
v one. I have some experience with it. Thankfully it's
already going to legacy and it was
running a few outdated engines and you had to change the schema if
you might be migrating for a more fresher RDS serverless
v two. It had the matching engine that we
already have for the RCs and all that we had to do,
we had to take the snapshot and challenge the endpoint of
the database. And you want to use the route 53
hosted zone for the database endpoint because you don't want to expose
your DNS RDs endpoint on live, but with
a route 53 private zone you can fix it. And if you
might be migrating and trying out different databases, you can create
something, for example mydatabasecon 42.
Com. Then you will be fixed. And since we create new instance,
your DNS record will be changed, but you will not have to update
your for example DB viewer connection. You will not have to change
your docker compose things or change the environmental variables
to access the stuff. And this is just a free
tip, not related. It's about in general about
the databases on AWS end,
what Aurora resolves, it eliminates the need
of application re engineering. We can have already the main writer
instance and read replica and Aurora will do the load balancing for
you. This is simply amazing. And for example, in our
case, application receives only single endpoint for the application.
And then for example, application comes
to me and says Dima, would you like to give me this information?
And then for example, I'm pretty busy
but I will just delegate the stuff to somebody else who
is working with me and I will just give it back.
And the application will not notice this because it's done by Aurora
itself and you don't have to change your schema everything,
you can just go up and down when you need. And in our
case it appeared to give the same performance but
with 35% less price
for us. And I'm extremely happy with this because
I haven't used any developer time
to change it. Mimigration was absolutely good
because it offers lots of mysql engines,
postgres, whatever you would like, and you can even start building
natively with the Auroras main engine.
And of course you can go to the calculator,
put the peak capacity and have some crazy values
which are way bigger than your RDS bill,
but you never use the peak capacity and this is
what resolves with Aurora. So for example,
this is the baseline of price for the M six G and
this is the chart of real workday of our application
on the Aurora. And as far as I can see, in some time
periods to deliver the better performance that we might have with the RGAs,
it goes up, but those peaks, they are compensated by
the idling time and
what is important about Aurora since it scales
and if you are scaling pretty aggressively, for example, you are
starting with a half of the unit, but if you are upscaling to four
units or even more,
then you should not expect
to have the performance immediately. Aurora takes some time to be
waked up and so probably you will not hit
the same rate of performance immediately. But at the peaks it's absolutely
comparable that what you have with the RGs, but you don't
burn the money when it's not used completely.
And there is an amazing example by the colleague Joe
Ho. He did a huge observability for a
few months with Aurora surplus calculation.
So please jump in and you can use this example for presentation for
your management and the colleagues. But he did a really big job and
I can confirm on myself that this data is valid and with the
right approach, Aurora be amazingly efficient.
Yeah, so thankfully and sadly we are coming to
the end. So in the whole process it's important to
communicate with your team. You must understand the decisions, why it
was made. In some certain ways. You must know the details because
we can talk about the database. For example on AWS,
on Azure, on Oracle cloud it might look the same, but only
from the surface. And a good cloud engineer should
know. So you might
have the same task by, for example, as I mentioned, to have the application running.
But yeah, it can be done in a different ways and
same things. For example, GP two, GP three,
they are basically EPs drives, but they are behaving differently.
Or the Aurora and RDS details
matter and be creative. You are the engineer,
you are the artist, you are architect. Imagine if the whole world was
built with panel houses. No, we have amazing buildings
like Stefan's dome, we have efield Tower, we have Big Ben,
because people, they are creative and this is a part of our work.
Also, we are not game designers, but we must be creative
to deliver amazing solutions, whatever is it. And with AWS
you can have everything covered and done. And this is amazing
and I'm really excited to work with such service and it's
really cool. Thank you very much for the
attention. I hope you have enjoyed the session. Please feel
free to contact me on LinkedIn. I'm always happy to have any
discussion, any tip for you. I can
dive in into your case and maybe we can discuss some specifics.
And if you can also have some recommendation for me,
I will be thankful to hear your opinion, what you say
and what can be done better. And please,
on the left there is a QR for my LinkedIn. And please also
check the rest of exciting colleagues who are participating in
this conference too. And thank you very much to the mark for the invitation.
It's a big pleasure for me to be here and also please write me
an email or come to my blog which I'm
about to launch very soon. And just to conclusion, thank you very
much also for AWS community in the dark
for all the support and we are hosting amazing events which is a
community day dark which will take place in September this
year so we are opening the registration very soon and
I will be happy to see you here. Then we can discuss the workloads in
the person and if you are looking for the user group in win
please check out our meetup page. So yeah I will
also be happy to see you here. So yeah, this is all for
today. Thank you very much. All the best. Best of luck in your aws
deployments and see you later in the clouds.