How to be GDPR compliant with ever-changing Big Data

The client’s big GDPR problem

It’s one thing to get to GDPR compliance, and it’s another thing to stay in compliance, especially when your data is really big, unstructured, and changing by the minute.  This was the problem faced by our client, a data science team at a Fortune 500 e-commerce giant. 

The data science team used terabytes of data, in various formats, in order to train and build machine learning models. The team was doing a great job quickly and efficiently creating their models. Data scientists had the autonomy to create datasets on an as-needed basis without oversight; the team used Hadoop and self-managed terabytes of data. 

The only problem was that there was Personally Identifiable Information (PII) lurking within all that self-managed, ever-changing data, and this kept them from being in compliance with GDPR. Obviously, asking them to self-manage their PII was unfeasible, as was the notion of limiting team members from creating datasets however and whenever they needed. 

Raybeam’s data compliance system

A tailored solution: Automated and flexible

At Raybeam we always keep our client’s unique needs top of mind when designing a solution. When we designed a data governance system for this particular data science team, we had a few constraints:

  1. We did not want to interfere with the team’s workflow, or create more overhead or extra work for them

  2. We wanted a solution that was flexible enough to handle data that was changing on a daily basis

  3. We wanted to make sure our client not only got to compliance, but stayed in compliance, and the team would be able to pass their audits regularly 

How it works: Scan and remove

We devised a system with three phases that could be scheduled by users, and that would enable minimal intervention from data science team members and keep the team in compliance, even as they created new datasets daily which potentially contained PII.

Raybeam's GDPR compliance system consists of three phases.

1. Scan all datasets

The first step in our process was to simply crawl through and compile a list of every single dataset (this could be a table or any type of file) that existed in the data science team’s world. 

The benefit of this compilation was now the team had a starting point of a data registry. Not only is this generally useful for the team to know all the data, sizes, and types of datasets that they work with -- but when additional compliance laws or policies crop up, they now have a central place to look at what they’re working with. Before this registry was compiled, the team really did not know how much data, or what types of data (parquet, hive, json, csv, etc) they were working with as a group.

2. Flag all datasets that contain PII

The next step in the process was to comb through all those datasets, identify which of them contained PII, and save the metadata around these datasets, which must be logged and stored for audit purposes.

3. Remove or redact PII from flagged datasets

The final stage of the process was the removal. This entailed taking the list of flagged datasets containing PII, removing all users from the dataset who need to be removed according to GDPR guidelines, and logging the removal process for audit purposes.

As a note, there are techniques where you can be in compliance without completely removing PII, such as obfuscation, but we chose to simply remove the offending data because it worked best for this particular team’s use case and how they used their data. 

To the finish line: GDPR audits and beyond

Putting into place this automated system based on their needs gave the team several advantages, all without sacrificing a change of workflow or adding onerous restrictions to the team’s way of working.

Firstly, in the short term, they were able to sail through an audit and legal review process that they had struggled with passing due to the nature of their dynamic data. They could now focus on their actual work of machine learning and creating models while their data automatically stayed in compliance.

Secondly, this framework puts them in a good position to pass future audits. At many companies, audits occur regularly, and rather than having to scramble for every legal review, having this framework in place makes it easy to document and provide evidence of compliance for subsequent audits.

Finally, the system also puts them in a strong position for any future laws or even changes to current laws. The automated system serves as a foundation, and any changes or additional compliance can easily be built on top of the current solution.

The future of data compliance

GDPR is not going away anytime soon - in fact, additional laws like CCPA are being added to the list of compliance your company will soon need to follow. It’s best to have an established, flexible framework that can be tweaked to handle new laws. That way you can continue working on your core business, rather than scrambling to adjust to a new set of compliance laws each time they come out. 

Previous
Previous

4 best practices for good data

Next
Next

Data Analyst’s Introduction to dbt