Transcript
This transcript was autogenerated. To make changes, submit a PR.
I am Antonio and today with Francesco we are going to present
GDPR and beyond demystifying data governance
challenges. Francesco and I are data architects
at Agilelab, an italian consulting firm specializing in
large scale data management. Agileab is
an effective and dynamic company structured around an olacosy
inspired model with multiple business units.
Through these business units, we are lucky enough to have several Fortune
500 companies as our customers. Ok,
lets take a look at what we have in the agenda. Today we will
talk about data privacy and GDPR,
why this european regulation is so important and we
should design systems to be compliant with it.
We will have an overview of different techniques that we can leverage
to be compliant and secure like anonymization and encryption.
Then we will compare these different techniques focusing
our attention on pros and cons of each one.
Finally, we represent a viable data sharing strategy
for real use. Case data is an UI,
am I right? Data is the fuel
for innovation. Machine learning, artificial intelligence and
analytics simply can't be possible without data.
Just as oil power engines data fuels algorithms,
enabling machines to learn and improve over time,
this is the case for machine learning. Also. Data is
the lifeblood of AI driving smart ecosystem
that can mimic human intelligence.
Finally, data analytics extracts valuable
insights just as refining oil produces useful products.
Yeah, data is the neural. Oil is the
neural also in the bad parts of it.
For example, data breaches are very similar to oil spills.
They cause extensive damage and leak sensitive information
that erode trust of our customers.
We can also have privacy violations which are very similar
to the pollution that can harm ecosystems.
Privacy violation disrupts the digital environment and arms
individuals. Then we have regulatory fines which
are very comparable to environmental fines,
which means that when your data is non compliance
with data protection regulation, you will get
very huge fines. Finally, you will suffer
for reputational damage because like environmental
damage, data breaches can severely
impact brand trust and loyalty. So let's
reflect on the reality of data breaches over the past 20
years. This visualization showcases the top 50
biggest data breaches. From 2004 to 2020,
117 billion records
were compromised. As we can see,
the severity of breaches is escalating during the years,
particularly from 2020 2016 onwards, with the web
sector being the hardest hit, accounting for nearly 10
billion records lost. Significant breaches span
across various sectors including finance, government and
tech, highlighting the widespread vulnerability.
Notable breaches include Yahoo 2013 losing
3 billion records and Facebook in 2019 with
530 million records exposed.
This growing trend underscores the critical need for robust data
security measures. These breaches not only
compromise personal information, but also erode
public trust and pose severe financial risks.
As we move forward, it is imperative to prioritize
data protection and adopt stringent security protocols
to safeguard these digital assets.
Thats why European Union came up with GDPR.
GDPR stands for General Data protection Regulation
and is a regulation that requires businesses around the world
to protect the personal data and privacies of European
Union citizens. Starting from on the
25 May 2018, GDPR puts
in place certain restrictions on the collection, use and retention
of personal data. Personal data is defined
as information relating to unidentified or identifiable
natural person. This includes data such as name,
email, phone number, in addition to data that may be
less obvious like API addresses, gps location, phone id
and more. GDPR is based on some
key principles and we will briefly
run first of all is lawfulness. Personal data
must be processed legally, adhering to established laws such
as GDPR. Then we have fairness.
Data processing should be fair, protecting vital
interest, performing tasks carried out in the public interest,
or pursuing legitimate interest of the data controller
or third party. Then we have transparency because
organizations must be open about their data processing activities.
They should provide clear, accessible and understandable information
to individuals about how their data is being used,
who is collecting it and why. And this is
why you have the cookie in every european
website now. Then we have purpose limitations.
So personal data must be collected for specified,
explicit and legitimate purposes and not further
process it in a manner that is not compatible with those purposes.
Then you have the data minimization principle. That means
organizations should collect only the personal data that is
necessary to achieve the specified purpose, and we will have
more on this later. Then we have accuracy. So personal
data must be accurate and kept up to date,
and inaccurate data should be corrected or deleted.
Storage limitation means that personal data should not be kept longer than
necessary for the purpose for which it was collected. For example,
an example of storage limitation is the right to be forgotten.
If I ask to be forgotten by some company, they should
delete all data about me. This is storage limitation.
Then we have integrity and confidentiality.
So organization must ensure that the security
of personal data, protecting it against unauthorized or
unlawful processing, accidental loss, destruction or damage
like data breaches. And then you have accountability. So data
controllers are responsible for complying with GDPR principles
and must be able to demonstrate their compliance.
In order to do that, GDPR creates some requirements around
the regulation itself. It requires that companies
have data protection impact assessments, which are tools used
to identify and mitigate risk associated with data processing
activity. They must follow data
breach notification regulation so they
should timely report data breaches to both authorities
and individuals. Then they need to
appoint a data protection officer, someone inside
the company that ensure that there is a person
responsible for overseeing data protection strategy
and compliance with GDPR. Obviously,
they need to implement data protection by design and by default.
So every data initiative and the company should
comply with this regulation without any
need to integrate it after. And then they
have some record keeping obligations. So they need to be sure
the companies maintain a detailed record of their data
processing activities. This for accountability and
compliance purposes. Obviously, GDPR had
huge implications for data governance. But what is data governance?
Data governance is the process of managing the availability,
usability, integrity and security of the data in enterprise
systems based on internal standards and policy that also control
data usage. An effective data governance
ensures that data is consistent and trustworthy
and doesn't get misused. GDPR had some implications
on internal data governance strategies for companies or enterprises,
such as they had to enhance data security and privacy control.
To be compliant. They needed to improve data quality
and accuracy. Because of the principle we've seen before,
they need to put an increase to accountability and
transparency in how data was used.
And obviously, GDPR put pressure and
created the necessity for regular audits and assessments around
data. Today, we will focus mostly
on the data minimization principle, which in our opinion
is one of the most important ones in GDPR.
Data minimization principle is foundational to responsible data
handling and privacy protection. Under GDPR,
data minimization principle mandates that organizations should
only collect and process the personal data that is
absolutely necessary for their specified purposes.
Imagine you're building a house. You wouldn't order extra
bricks that you'll never use as it would be wasteful and
clutter your space. Similarly, in data processing,
we should avoid collecting excess data. By adhering
to this principle, we not only streamline
our data management practices, but also enhance security
and compliance. Collecting minimal data
reduces the risk of breaches and misuse,
ensuring we respect our customers privacy and build their
trust. This also simplifies data management and
can lead to more efficient processes. So let's commit
to collecting only what we need, protecting privacy
and fostering a culture of data responsibility.
So, does this meme look familiar?
We know that people working on AI, machine learning and
analytics need real data to do their job,
but this clashes with GDPR regulation 99% of the
time. I will leave now the stage to Francesco,
who will show you how we can build a compliant data sharing strategy
and still allow data practice share to be effective.
Here we go, Francesco. This meme will look familiar.
This is a quite common scenario since most of ML
engineers and data scientists need to prototype everyday
their models. In the majority of the cases they
start using development data, but the risk is that when
moving to production, the performance of the model is low,
so they fall back in using sampler data from production.
This also increased the risk of sensitivity, data leakage
and exposure in the minor environment. In the next slide,
let's see how we can unfold this issue and which techniques
could help our case. So the first thing we are going to
talk about is about anonymization.
So data anonymization is one of the techniques that organizations
can use in order to adapt restrict data privacy regulation,
but require the security of personal identifiable information such
as health reports, contact information and
financial details. It affects pronunciation
since it is not a reversible operation.
Cellular initiation simply reduce the correlation of
data set with the original identity of a data subject
and is therefore a useful but not an absolute security
measure. And let's take a look now at the most
common one technique about below the act of anonymization.
The first one that we are going to talk about is generalization.
Generalization usually changes the scale of the data set,
attributes or the order of magnitude. As you may see,
we have a simple table with several columns like name,
age, birth date, state, and disease.
In the example one, a field that includes number like
age can be generalized by expressing an interval.
As you may see, mark age has been
put within an interval between 20 and 30.
In the example two, a filtering class dates like
1993 1019 can be
generalized by using only the year 1993,
and this is the very first method. The other one is
randomization. Randomization involves changing attributes
in a dataset so that they are less precise while
maintaining their overall distribution. Below the app
of randomization, we have textbooks such as noise
addition and shuffling. Noise addition methods
provides to inject some modification within the data
set in order to make it less accurate, for example,
increasing or decreasing the age of a person,
as we can see in our example inside the slide
while shuffling, simply swap the the age of mark and
john and this is the second method.
The third method is the most common one is the
suppression, another useful technique, the most
used in the space of an analyzation. In my opinion,
suppression is the process of removing an attribute's value
entirely from a data set, while reduction removes
part of the attribute value from a data set.
With such techniques you can have multiple issues.
For example, the warning number one is if
the data is collecting for the purpose of determining
at which age individuals are most likely to develop a specific
illness condition suppressing the age data
would make the data itself useless. The warning number
two is that the data type will change from integer
to string and this will break the contract for all the
data consumer of that kind of data asset.
In this slide, as you can see, we have a brief comparison
of the methods explained previously. For each strategy
we evaluated three main secrecy,
privacy and utility. And for each capital
strategy factor we assigned a rating ranging
from poor to best. As you may see, every method has
its weakness, so we do not have an evidence of
a superior technique that could address all the items and
factors around data. In this case c four c,
privacy and utility. What we can say is it
depends a lot from the use case,
but now let's take a look at the encryption methods in
order to understand if they could help in the context of
compliance. The first
method we are going to talk about is format preserving encryption.
Format preserving encryption, or SPE is
a symmetric encryption algorithm which preserves
the format and of the information while it is
being encrypted. FPE is weaker than
advanced decryption starter. AE's performance
presenting encryption can present the length of data as
well as its format. FP is by Nissan standard and
there are three different model of operation, ff one, ff two
and ff three. FPE works very
well with existing applications as well as new applications.
If an application needs data of a certain language
format, then FBE is the way to grow.
In order to operate with this algorithm,
you should use a separate key and a tweak. Another implementation
is provided by Bouncy Castle and the other one is
available on Google Cloud as well as provincial toolkit.
Now that we have seen the first encryption maple,
let's take a look at another one.
Homomorphic encryption omofic encryption provides the
ability to compute on data while the data is encrypted.
It sounds like magic, don't you? There are three different
modes. In this case partially Omar pick encryption that
allows a ten mathematical function to be used,
for example addition or multiplication.
Some automorphic encryption. Some function can be
performed only a fixed number of times or
up to a certain level of complexity.
Or in the end we have also fully mm that
allows all the function mathematical function to be performed
on unlimited times up to any level of complexity
without requiring the decryption of the data.
So suppose you want to overwrite some sensitive data
in a cloud. In the picture you can see you
can have that you have on the left the traditional approach.
In the example you can encrypt the files before moving
to the cloud. For example, with a standard Andrew algorithm
like AE's then if you want to perform some transformation
on these files, you have to decrypt, apply the transformation and
then encrypt again. This will expose data
at risk and also introduce a complex operator on
the right side. Instead you are going to use the amomorphic encryption.
Once you are on the cloud you can do computation on separate
text. Also decrypt you will obtain the same result of
applying the function to the plaintext data.
Unfortunately, it requires a significant computational
operator to perform the intensity calculation,
making this kind of strategy very slow and very
resource intensive. In addition to passwords
concern, implementation of this specific caliber
can be very challenging with highly complex techniques.
Is it all? No, we have also other strategies and
methods to present on the table. One of these
is tokenization. Tokenization involves substituting
sensitive data like credit card number with non
sensitive token which are stored securely
in a separate database called Totembo.
Synthetic data is artificial data generated
with the purpose of preserving privacy testing system
or creating training data for machine learning algorithms.
Synthetic data generation is a critical and very
complicated for two main reasons, quality and
secrecy. Synthetic data that can be reverse engineered
to identify real data would not be useful in
privacy context. Faker is a Python
package that generates fake data for you. There is also mockru
that is another representative into the ecosystem of
mock data that allows you to quickly and easily
the low large amounts of randomly generated test data
based on the specification that you define. This is
all in this case. Now let's take a
look and summarize what we have learned in the previous slide.
The first thing that you could do in order to have
to share data for machine learning purpose and
analysis in minor environment is to use sample data
coming from probably from the production environment.
The first thing as a pro ultra realist sampled production
data provide a realistic representation of actual
data, helping developers and testers to identify
issues that may not be evident with synthetic
or mock data. You have an improved testing, so using real
data allow for more comprehensive and accurate
testing of functionality that integrity,
performance and scalability. Then you have stakeholder
confidence. Using real data increase stakeholder confidence
in the testing process and the reliability of the development
lifecycle. Within the cons you have
that you have a lot of issues with privacy and
compliance. Even sample data
can contain sense of information, raising privacy
concern and potential non compliance with data
protection regulations such as GDPR and
other ones like Hapa.
Security risk using production data in minor
environment like development or QA increase
the risk of data breach and unauthorized access.
Data freshness. This is another issue. Sample data might become
outdated quickly, leading to scenarios where dead
environments are not completely aligned with
current production environment. Let's take a look at
standard synthetic data privacy and securities for sure
are pros. So synthetic data can be generated
without any real world personal data, significantly reducing
privacy concern and the risk of data leakage
availability. Synth data can be created on demand
efficiency. Generating syntactic data can be more
cost effective than collecting and labeling a large
volume of real world data. On the contrary,
what we have lack of real is synthetic data may not capture all the
complexities and the nuance of the real world data.
We can have problem with overfitting. Also,
there will be validation challenges. Validating the accuracy
and the reliability of synthetic diagonal could be very challenging
as it requires ensuring that the synthetic data closely
mimics real warm data distribution. There is also
concern about the complexity. It will require
isotope complexity creating high quality synthetic
data the wheel requires sophisticated techniques and domain knowledge,
making the initial setup complex and resource intensive.
Encrypted data source let's talk about it, focusing our
attention on standard algorithm privacy protection.
Anonymized data with encryption reduce the risk of
exposing personal information. We are
compliant in regulatory sets.
Using anonymized data helps organization to comply with
GDPR, CPI and so on. We can enable
easily a data sharing mechanism and
we can share this data with a little
bit of freedom within department,
organization or also with external partners.
We also have a risk mitigation because we are going to
reduce the potential for data breach and as
the data no longer contains personal identifiable
information. But on the contrary, we have some
complexity. Working with encrypted data,
especially in the case of AE's algorithm,
can complicate development phase and also
testing activities because data will lose any kind
of meaning, will only keep this distribution.
Key management challenges effective key management
is crucial and can be complex,
especially in non production environment where multiple
teams and individuals may need access to encryption.
Limited testing the curvature testing with encrypted data
may not reflect through application behavior. If decryption process introduces
delays or errors that wouldn't be or
wouldn't occur in production. Anonymized data as
we have seen before, the complex it would be
complex. The anonymous process probably anonymizing
data can be very challenging, requiring some sophisticated techniques
and ongoing management to ensure data remains anonymous.
There is also the problem of radio identification
on the data subject. There is a risk that anonymized data
can be reverted, especially if combined with
other data sets, for example, knit attack
and so forth. In some case we lose the utility,
as we have seen for suppression, but now that
we have summarized all the possible methods and
techniques, at least the most important within the
compliance context, let's take a look at the
next slide. We are going to present a possible strategy
for sharing data in a quite secure way in a minor environment.
The practice I'm going to show you will combine some of the methods
that we have seen before in the context of a data lake.
So before moving forward, let's have a little bit
of context. We are in the cloud and we have a data layer
in the specified case. In the specific case, we leverage the medallion
architecture for our storage layer. The most of you already
know what it is a medallion architecture. It is also known
as a multi hole architecture. Data at each
stage get richer by increasing the intrinsic value.
At each stage, PIi can be present and usually machine
learning engineer and other scientists operate at silver
gold layer. In this slide, let's see what
we can do to shift data and minor environments and enable
a safe data consumption. This is a receipt for a
cloud based scenario, for example AWS, but can be
easily replicated in other cloud vendors.
So you have a production account on the top and
for simplification purpose, an on production account in
the bottom. In each layer we have the usual
medallion architecture that we have seen before.
The step one requires that data teams
are in charge to prioritize their job and anonymize
data. The encryption process becomes a mandatory step
in the data lifecycle made of data ingestion,
data normalization and delivery.
In step two, we will open a read only cross policy
account from production to run product. The minor environment
never writes to prod. It is all enabled for reading operations.
The encryption key is never shared with manner
environments. By this way, user Personas that
walk in the lower end barnet are enabled in doing their job.
Data in video can prototype new silver dataset reading
from the bronze encrypted layer. Analysts can model
new schema and generate new anonymized report
data. Scientists can propagate their model on quite
realistic data set since only sensitive
column will be encrypted. Let's take a look at
which are the benefits of this kind of practice and strategies.
Analyst says format preserving encryption
guarantees reference integrity, no schema
change across difference data sets and allows
to reuse business logic. Derek Jones still
works. After the encryption data,
engineers are allowed to read only the encrypted layers liberated
on a doc IAM policy. This is going to simplify
the data movement and the orchestration process between environments.
Machine learning engineers can prototype and train their job
models on acquired real data on
a safe layer. DevOps practice is still in
place since deployments of new artifacts and models
can follow the standard CICB flow.
Secops says minimization principle is respected
on minor end since most of the time you have encrypted
information. Now let's take a look at and
a don't feature and let's talk about all
the right to be forgotten below the ecosystem of
GDPR. In this slide we are going to talk and present
the crypto shredding technique crypto shredding
is the practice of deleting data by deleting
or overdriving bankruptcry. This is going
to require that the data have been encrypted from
deleting. The key will automatically logically delete the record
and all the existing copies since all the
encrypted info are not reversible anymore.
This approach is very useful when you have multiple copy
of data, for example the card or multiple layers
of data like in the Medellin architecture.
If you are in the early stage of creating your data layer
and building the foundation, you can combine crypto
shredding and format preserving encryption in order to
enable a very interesting scenario that will catch
the sacred data sharing practice that Yves explained before
and the deletion problem of multiple layers on
environments. It is worthless to say that
all these techniques and strategies will function with a strong
than a governance practice place, knowing in other
bounds that where Pii are stored and
their lineage is fundamental. But this is another story.
Thank you everybody and let's get in touch from any question
and answer.