Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining me at DevOps 2022.
My name is Leonard. I am a developer relations engineer at Google.
You in my role I help
developers to improve observability of their applications.
Today I would like to talk to you about logs anonymization
and how it can be implemented.
Logs anonymization is one of the tasks in
an implementation of the data compliance.
As you know, data should be compliant to standards
and regulations existing in the industry such
as JEPR, HIPAA, PSIDSs and many
others. Implementing this
compliance can be very expensive, both from the business
and engineering perspective. It requires changes
to the application code, modification to the environments and
engineering processes can be especially
complex for applications architected
in multi tier distributed service meshes
or deployed across multiple geographic locations,
which can be subject to different standards
and regulations.
The compliance address all data that
application generates, including data input
by the user, data generated and
maintained by the application, and also data
created by other sources such
as CI CD workflows.
Data in the logs is also
a part of the compliance regulations
and has to be compliant.
It can be especially tricky because in many
implementations of the data, compliance logs observed
as a solution of various compliance
requirements and not as a target for
this compliance regulations.
Implementing the compliance also referenced as compliance
controls can be roughly divided into three
types. Preventive controls implement
requirements that don't
allow certain operation or use of the data.
When you think about these types of controls, you can think
about identity and access management as one of the implementation
of this type of controls. Another type
of the controls is detective controls.
It essentially captures all
information about the activities that takes
part and other things,
and it allows to detect
and alert any violations
of the compliance or it can be used
in the postmortem investigations related
to the compliance validation.
Audit logs are often used
as part and other application logs are often used as
part of the implementations of this control. The last
type of controls, corrective controls,
are used to implement the reaction on
the detected compliance violations.
Let's look how these controls are
implemented for the data stored in the
logs.
The data in the logs the compliance controls can
be applied into two processes.
First is log ingestion or logs generation,
and the second logs access.
Compliance controls for log access are straightforward.
They usually focus on preemptive and detective
controls and can be implemented
using identity and access management, audit logs and
infrastructure that
persist the storage of the log data.
This compliance control for the log access are usually
universal and across different regulations
and the standards. Compliance controls about log ingestion
is different in the
way that it usually involves detective
and corrective controls, which can be specific
for the industry
the application is used
as well as to a geographic location where the
application is deployed.
The diagram that you see
roughly depicts the process of the log ingestion and
how the detective and corrective controls can be applied.
As you can see, there are two
types of corrective controls that can be implemented
for the log ingestion flow. The first type is
just removing the
data that violates compliance from
the logs or just dropping the whole log
entry or the second type of
activity is obfuscating partially
or fully the data
that violates the compliance.
Implementing obfuscation logic requires special
knowledge because sometimes
it's not enough just to identify
the data in the logs so the original
information cannot be restored, but sometimes
it requires preserving certain properties
or qualities of this data for
future use. As an example, think about troubleshooting
tasks that engineers have to perform and
to use the logs to restore
the original transaction logic that application run.
If for example, the application logs
store user identity and
use Social Security number of the user of this
as a primary user identity. And yes, I know
it is wrong and unfortunately many applications
still do this. So it's
not enough just to obfuscate the
part in the logs that store SSM
because this way it will be impossible to
follow up on the logic of the application flow.
It is also important that this affuscated part will
still be unique across all other
logs and will be the same for the same under.
So this kind of implementation can
be sometimes changing to do when
developers implement corrective control
for their logs. Now I would like
to review basic implementations
that can be done for the logs anonymization.
I will do it based on the simplified architecture that
captures main challenging elements that
you might see when implementing
logs and minimization.
The architecture includes two tier application with the
front end application running on user devices and
the back end service that runs
behind the firewall. On the diagram you can
see blue arrows that captured business
communication and transactions of the application and
gray arrows that capture flows.
Log ingestion flows. Please pay attention
that it includes logs captured
from the infrastructure. In this diagram
it is presented by the load balancer and the application
logs as well.
The first solution is engineering
oriented solution. This is what R
and D teams usually do when tasked with
the logs anonymization.
It is implementing the corrective and detective
controls as close to the place where the log
entries are generated as possible.
Clearly the ownership of the solution
is on the R D teams which rise the
potential problem for maintaining
the application. This is where DevOps originally
was born from,
so follow up work for
maintaining it can be very expensive.
As mentioned, implementing it requires
specialized knowledge that it
is possible that RND teams don't have.
Additionally, this kind of implementation runs
the solution as part of the business application
process. As such,
it consumes the same resources that originally
allocated for the application itself and reduce
the application performance. Additionally, it means
that if any problem is identified
within the implementation of the compliance controls
or some kind of
modification is required due to
changes in the regulations, it will
require enrolling a new version of the application. Even if nothing
from the business perspective have
changed. Deploying such
solution in multiple geographic location can be
very changing because it will require some complex
and probably again poorly
maintainable configuration solution or
deploying the solution with more than one implementation
of the detective or corrective controls, it will be able to
run correctly in more than one
geographic locations.
And the last but not least important,
you can see that this implementation cannot handle
logs ingested by the infrastructure itself.
So there is another solution
that often can be seen that tries
to mitigate some of these drawbacks.
This solution basically delegates
the actual work to a third party or
standalone implementation.
Usually this kind of implementations are referenced
as data loss prevention service.
They can be found as a commercial solution
or can be developed in house with the same drawbacks of
the specialized knowledge. It depends on each
particular organization.
In this solution, RND teams
share the responsibility with the DevOps because
enrolling and maintaining the
DLP service falls on the DevOps team.
While still some of the implementation
of the changes remain under the R D team
responsibility. For example,
to work properly, DLP solutions should be aware about the
source of the logs. If the
application does not work with structured
logs or it works with the structured logs but does not
provide enough metadata about the logs source,
such as what kind of service
generated the log, where the service
located, what is the specifics of the
environment, and so on. This type of data has
to be added and R d
team will be responsible to implement this change.
Additionally, all logic of ingesting
logs the application site has to be modified
to redirect logs to the DLP.
In many cases, R D teams will have
to do this work. This kind of
changes can be partially mitigated by implementing
some logging agent, but it's not always possible to do.
This solution has many advantages compared
to the previous one. First,
the maintenance is usually
done by the DevOps, so there
is a team that specialized
in the maintenance and it can partially reduce
maintenance cost. Secondary it
can remove a need in specialized knowledge in the case, the third
party solution is used, which also simplifies
the compliance validation process and
guarantees that compliance requirements for logs
and minimization are implemented correctly.
Additionally,
it runs in its own environment. As a result,
it does not consume resources that
originally intended to run business logic.
It can be enrolled separately being a
standalone service, so any changes or configuration
in configuration and implementation don't influence
the application.
Clearly it is better
fit for scaling and for running
in multiple geographic locations.
However, it still has
few drawbacks.
It is still application focused, so infrastructure
logs still get unattended.
It still possess relatively high maintenance cost. Also,
the cost is shifted toward DevOps
and it may, as I mentioned, still require
changes in the application code as a result of
working versus DLP.
So today, when a
lot of applications run in the cloud and
when speaking about the maintenance cost, one of the possible solutions
is to
leverage the cloud so we
can modify the previous implementation by
shifting all
work relevant to logs into the cloud.
This provides us with a couple of advantages.
First, it allows almost completely
remove R D team involvement,
meaning no need for additional work
from the R D team.
DevOps team gets full control over the
solution from the beginning.
The solution implements redirection
of the logs using the proxy
that fronting the logs management backend and
sending all ingested logs into DLP
first and then from DLP
to the logs management.
Usually when the rest of the application already
runs in the cloud
hosting and in
majority cloud hosting, the logs already
get enriched with additional metadata and converted
to some kind of structured logs so it doesn't require
additional modifications in the application code.
And for hosts that don't
do it, it is possible to use various
logs engine such as fluent d or
fluent bit that can reformat
the logs, especially if the logs are printed to
the standard output or just service a local
endpoint for log ingestion and
then forward the logs with additional information
to the back end. The same can be done with open
telemetry, but today I don't want to touch upon
this solution let you
implement the logs anonymization without
having specialized knowledge about data identification
or obfuscation. You can do it
in a scaled and flexible way without
changing single line of code and keeping all the work
within the DevOps team which already specialize
on the maintenance from another side. The additional maintenance cost
can be significantly reduced if you host
this solution in the cloud which already provide you
partial maintenance. It is
especially easy when you use one
of the major cloud providers which have
managed solution for this type of service.
I would like to show you a reference architecture
that will work and
help you implement this solution in the Google Cloud.
This architecture fully
hosted in the Google Cloud. It includes four
different managed services cloud
logging pubsap which is
managed synchronous messaging service,
Apache Beam which is managed
ETL pipeline and cloud
DLP. The service that provides
implementation of the detective and corrective controls
out of the box logging
service in Google Cloud supports not only
storing the logs, it also supports
log routing. So you
will not require to implement this proxy
by yourself. You will have only to configure the proper
routing logic in your
environment.
Additionally, in Google Cloud,
structured logs with all necessary metadata
about the source of the logs is provided out
of the box and Google Cloud provides
you with a log agent that can be installed on
your environment and will enrich
your logs with relevant information.
So what is most interesting about
this flow is the compliance controls implementation.
So let me walk you through the ETL pipeline
implemented in the dataflow service which is
the managed version of the Apache bin.
So for those
of you who are not familiar with Apache bin is
it is extract, transform and cloud pipeline
shortly ETL that allows you to
define multiple transformation for
data that get ingested into the pipeline and
at the end it exports this data to the predefined
destination. It can be done on the bulk of
the data, which is the kind
of final set of the input data, or it can be done
for the stream. So in our case,
all logs entries just get streamed
into this pipeline using the logging
and service which routes all logs
from the application and from the infrastructure into the pipeline.
The first step will be to transform it from the pub sub
message and retrieve the relevant
data of the log entry
and then the second step will be to aggregate this
data into big batches and
then these batches are sent to the DLP.
The purpose of the aggregation of multiple
entries into the batches is to
save both in cost and performance because
all cloud providers,
including Google Cloud,
introduce some kind of quotas, the total API
performance on large volumes.
Additionally, most of the cloud providers
charge you per API call and sending
multiple entries for the single call
can save you significantly on your
cloud bill for
each month. Additionally, it allows you to
scale very well with the log
intensive application.
DLP API allows you to
define configuration both for detective and
corrective controls, and once
detective controls for each entry
that detective control identifies as matching the condition,
the corrective control logic can be applied.
Eventually the processed
batch is returned from DLP. It is
formatted into the log
entries format and ingested eventually to
the log storage bucket.
This flow is implemented.
You can find the implementation of this flow on
the GitHub repo. Please find
the link in the slides and
if you want to have a
more in depth understanding about this reference architecture
implementation, you can read it in the blog post I
posted on the menu. If you have additional questions
about this topic or about any other topic of observability,
I will be glad to chat with you on discord.
Thank you for being with me.