Conf42 Platform Engineering 2024 - Online

- premiere 5PM GMT

Building Privacy Aware Infrastructure

Video size:

Abstract

Concepts of functional programming and performance optimization are my specialty when addressing challenges associated with big data. Let’s connect to explore how I can help your organization build efficient data pipelines that drive business value.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Natalia, and today we're gonna be talking about privacy in particular, how to build infrastructure and products, perhaps some company processes with privacy in mind. A couple of words about me. I'm a software engineer with over 10 years of overall experience. The last three years of which I've been working in privacy in big tech. And what I've noticed is that privacy is a highly ambiguous and extensive domain with lots of terms and abbreviations and people get lost in it. Therefore, I'm hoping to bring a little bit of clarity around the topic. And give a high level overview of the scope and its relation to infrastructure and platform engineering. And hopefully in a way that doesn't make you fall asleep. Now a little bit around the motivation. Why do we want to care about privacy? First we want to have high ethical standards. We also want to build as well as maintain user trust and therefore care for our reputation as a company. And lastly, we don't want to have any problems with the regulators. and regulations, talking about which regulations are taking over the world. And it's becoming more challenging to find a country with, without data protection laws in place. On this picture, I highlighted only the main regulations, something I'm familiar with. But I'm sure there are a lot more and the central regulation is GDPR. I'm sure you've heard of it And most of other countries get inspired by it when producing their own data protection laws Therefore by making sure you are GDPR compliant, you automatically get compliant with most of the other regulations. Not all of them, some countries are a little bit more special, for example, the US with its Federal Trade Commission. But still, GDPR is a good reference point for compliance. Now, When I say privacy, most people usually hear Signal, Secret Shots, Onion Network, Blockchain, etc. And even my friends who work in tech but are out of the privacy domain, they think I work on end to end encryption, making sure people have private means of communication. And it's funny because I wish they knew how different my work is. Actually is. It's a lot less cool. But the point I wanted to make here is that privacy is so much more than security. And indeed security and encryption and 2N encryption is an aspect of privacy. But privacy is a lot more. And so far, it's probably been the most ambiguous and multifaceted domain I've ever worked in due to its substantial regulatory compliance side. Now, I want to give an overview of the main principles to give you an idea of what the scope is. And I tried to make it as short as possible so you guys don't get bored. But bear with me here, this is important. The privacy principles. The user personal data that you collect has to be protected from unauthorized access, theft, loss of damage, right? Now, you also as a company, organization, you want to be transparent with the user about what data you collect, what people collect. purpose you collected for, who you perhaps shared with, how long you're going to retain it for. Those annoying privacy notices that you often see, this is about the transparency principle. Lots of people find them annoying, but this is something that is required by law. Now, the user should have a means to access their personal data, perhaps modify it, opt out for some parts. And delete it all together, which is the right to be forgotten. The user has to have a simple means of deleting all of their personal data at your system. Now, this is not all. I have a couple of more to you, for you guys. Data minimization is about Collecting only the necessary data needed for your functional functionality to function. You can't repurpose the data. So let's say you collected user emails for registration purposes. You can't later use it for analytics, for example, or for marketing without the user consent. Yeah, purpose limitation is about not repurposing the data. Now, there has to be a defined period of time you are supposed to retain the data for. GDPR, for example, doesn't have a specific retention period. You have to come up with it for yourself, as long as it's defined. And you shouldn't store data longer than that. And finally, accountability as an organization, you are held responsible for adhering to the privacy regulations, and you must be able to demonstrate compliance upon request. So these are the main principles, but not all of them. I haven't touched on some of the aspects like data sharing, and I'm sure there are a few details I have, I might've missed here. However, that's the general idea and I hope the scope is more or less clear. So now we can talk about infrastructure. So we want to build infrastructure with privacy layer. How do we do that? So it's helpful to think about it, visualizing three main components. We have our regulations and commitments, which are often translated to policies. We have our data and we have infrastructure and product code, and they interact with one another in such a way where the data gets associated with. a set of policies, then your code is aware of what kind of data it processes. It recognizes what data is sensitive and it is aware of policies and is able to enforce it. Hopefully it makes sense. I'll make a little bit of a pause here to make sure it's digested and we can talk about each of these components in more detail. So regulations and commitments. We know what a regulation is, right? Something that is defined, a rule defined by regulators. Whereas a commitment is something a company chooses to come up with themselves. For example, WhatsApp positions itself as an end to end encryption messenger. So end to end encryption is not required by law. But it's a commitment and the law requires what's up to stick to it. So we have our regulations, we might have some additional commitments, and they are translated into policies. A policy consists of three things. It has a scope, meaning the geographical, Every way it applies, it has its enforcement logic, and it has a way to get associated with the data. Just a couple of examples here to make sure it's clear what policy is for example, email shouldn't be used for marketing. purposes without consent in the EU. It has its geographical tag, it's got the data element site, and it's got logic. Retention policy is something I took from the US. The bank statements can't be retained for longer than seven years. It's also a good example of a policy. Once we have our policies, well defined and established, we can talk about data understanding. So data understanding is crucial. Usually it consists of approximately the following steps. So first would be categorizing your data and defining what of the data that you process is sensitive and register registering it in one and another way. Usually it gets registered as a set of annotations centrally or taxonomy and when you have that you can annotate your data throughout your infrastructure, usually at the API levels, your data transfer objects, data base schemas, any data sync that you have can be annotated. So this is quite helpful to track where the data is stored and where it travels through. And an advanced step would be building something called data lineage. And data lineage is a way to track the life cycle of the data across the infrastructure. Ideally, we want to pick a data element, let's say an email in this case, and we want to see a graph showing how it travels to our platform, where it comes in what elements it comes to. goes through and where does it end up? This is a powerful tool that helps you to gain a lot of control over your data, but also you can build a lot of things on top of that. Particularly, you can build a continuous compliance automated breach detection. For example, you can continuously verify that Some of the data parts do not end up where they're not supposed to end up, or not retained for longer than they're supposed to be retained for. And also you can generate some evidence for the regulatory compliance. There are some solutions out there that you can use for data lineage, but you can also build it yourself. Something like production sampling, dynamic analysis or instrumentation might be used data probing, and even static analysis can be used for at least complementing your data lineage graph. Okay. So a little bit on the infrastructure and code. So it integrates our policies in a way where it knows how to associate sensitive data with a set of policies often propagating policy context with your data is really helpful if you have context propagation in place. If your policy or set of policies can travel through the invocation tree of your code with the data, that would be really helpful. And finally, when the data reaches its destination point or some other points the policy can be enforced by the code. All right. So hopefully that was more or less clear. And now I want to talk a little bit about the practices that are out there that help us to prevent privacy incidents and protect user data. So some of the Practices are relatively straight forward, and you probably already have them in place. Something like access control for security that ensures that our data is protected from unauthorized access. Encryption makes sure that our data is protected in case of unauthorized access. Some additional practices like anonymization, sorry, I have always trouble pronouncing it. Anonymization, you can modify your data before sharing it either with a third party company or with, let's say, your analytics level. Lots of other people throughout the company has access to it. You can anonymize it. Delete the purpose, personally identifiable information. Let's say, delete or replace user IDs by a replacement ID. You can store this map, mappings between ID and replacement ID, somewhere separate on the separate security layer. So the data can't be tracked back to your user. Now, data coarsening is about reducing the granularity of your data. Let's say you have a geographical location of a user, and perhaps you don't need the precise location for analytics. Maybe you can make it approximate, and Then analyze it. So that could also help to reduce some of the privacy incidents, some of the organizational measures that you can have in place for privacy. As I mentioned before, having well-defined privacy policies in your company is really important. Have them documented and very well defined. And once you have that, you can train your employees. To make sure they are aware of the privacy policies they are aware of the privacy regulations, the principles, and they can build your product keeping that in mind. Some of the processes like privacy reviews, let's say you release a feature and you want to make sure double check that it's privacy compliant, it can be verified by another team for privacy policies. And regular risk assessments and compliance audits. but also help to maybe identify some vulnerabilities in your system and also reduce the risk of a privacy incident. And I want to also say a couple of words on continuous compliance. One thing is to be compliant when you say you launch a product or release a feature, but staying compliant is a little bit of a different thing. Thank you. Especially as the company grows bigger, things tend to get out of control and having continuous compliance becomes more and more useful, automated. Breach detection is about having a set of processes in place. Let's say some regular verification jobs that make sure that the data is still compliant against your set of policies. And if there is a breach that is found, you can raise an incident. It can be something as simple as an email or something a little bit more complicated, like I don't know. Setting a fire alarm filing a task, and yeah, as long, it can be anything as long as it's addressed. Yeah, so hopefully I still have your attention guys, and there are just a couple of slides left. And now I want to talk about privacy investments that you can do today in order to avoid a headache. So I decided to break it down by company size. So there is a scale here. It goes from a startup to big tech, and depending where you are on the scale, you might choose to do different things. So let's say you're a startup, right? Early stages of development. Things are uncertain. No Time to think about privacy. Probably the last thing you want to think about is privacy. And I totally understand that. So a couple of foundational things that you can do that are not necessarily privacy related, but they can help you to reduce the privacy headache. So context propagation, I highly recommend that it is really useful, not only for privacy, but for a number of other things particularly logging and debugging, but also having a sense of control around what's going on in your system. But then on top of that, you can do privacy propagation too. So distributed Tracing is making sure that Your context is propagated from a service to a service and beyond. Having distributed tracing, make sure you have your context propagation across the whole entire system. Also, at this point, you can implement something basic as constant implementation. It wouldn't take too much, so why not doing that? And having basic security means encryption, access control, and perhaps something else. But also, not hurt, at least. So let's say you're a growing company now, you've got more users, more data, more employees coming in and out, and things are getting serious at this point, and inevitably more hectic. So this is a good time to have your privacy policies already established. And again, once you have that, you can train your employees to make sure they are aware of what privacy is and how to make sure the product they built is compliant. So understanding and categorizing your data is really helpful. Here, too, at this point, you can annotate your data on top of the context provocation that we hopefully previously built. We can do policy propagation and already start implementing some policy enforcement logic once the data and the set of policies reaches a certain point right. Now, you are a bigger company now. In fact, you're huge and you've got even more users. Tremendous amount of users, more data, more employees, and things might be even getting out of control at this point. And because it's becoming more difficult to keep track of things. So at this point, you're also an easier target to the regulator. So one day they can, they are more likely to knock on the door one day, because you have more impact on users and society in general. So data lineage It's probably a good idea. And it would help you to, sorry, this is my cat making noise. It would help you to keep track of things. Continuous compliance, automated breach detection and automatic evidence generation is probably, would also probably be helpful here too. And of course, a regular risk assessments, compliance audits. audits would also help to reduce the risks of privacy incidents to a significant extent. So this is the end of my slides. If you take away from you guys, privacy is something that benefits from investing early in pretty much like everything else, but still privacy, especially, and understanding and categorizing your data would help, Keeps things structured and having well defined established policies would also provide a significant advantage. Thank you very much. I hope that was clear and somewhat useful, and I hope to see you next time. Bye.
...

Natalia Grybovska

Software Engineer @ Meta

Natalia Grybovska's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways