Solving GDPR Challenges with Okera (Part 1)

In this two part series we will focus on GDPR challenges facing organizations and how Okera can help solve them. This first post will cover Consent and the Right to be Forgotten. The second post will cover Pseudonymization.

GDPR Background

In May of 2018, the new General Data Protection Regulation (GDPR) of the European Union (EU) will go into effect. With the growing number of data leaks and breaches across the globe, the EU decided it was time to update its privacy laws.

You may read this from somewhere outside of the EU and wonder what affect this will have on you. If you happen to be providing an online service with customers in one of the EU countries, then you should be concerned. Not complying to the GDPR rules may result in steep penalties, ranging into the millions of Euro.

A quick internet search will yield countless articles and blog posts that define and discuss the GDPR and its key requirements, so we will refrain from repeating them here. We will instead focus on two of the main technical hurdles an international enterprise will have to overcome to be compliant:

  • Consent and Right to be forgotten
  • Pseudonymisation

Generally speaking, GPDR applies to data controllers, which is any organization collecting and/or processing data from EU residents, and is ultimately responsible for the safekeeping of that data. The regulation requires that you report any issues with that data (such as breaches) immediately, as well as architect your entire IT infrastructure and application development with a focus on privacy by design. The advantage of designing an entire infrastructure around privacy by design is that every layer in your hard- and software stack will be built with current and future security best practices in mind.

Conversely, with the advent of Big Data and its proliferation within the last decade, you are most likely to see an abundance of storage technologies and software systems in use in large enterprises. Since “betting on one horse” is, in practice, not in the interest of typically risk adverse enterprises, it is common for various teams within the enterprise to use a variety of technologies to solve the same problem: collecting every bit of data that is available in an attempt to fulfill the promise of the unreasonable effectiveness of data. Moore’s Law got us to a stage where we, as individuals (think smart devices and always connected computers) could produce so much more tangible data points that we had to build an equally limitless pool of storage to cope with the onslaught of that data – and that was no longer a technical challenge for the same reason.

Now there were cloud-based or on premise Hadoop clusters or distributed file and object storage systems, which consumed any signal that was available from both customer/users (for instance, clickstream or sales data) and commercial sources (like weather data or social media graphs). These systems continue to collect today and will likely do so for the foreseeable future. The challenge now is: How do you make this data accessible to your analysts or data scientist?

On top of that challenge, the GDPR states that a subject (a person or entity whose data is stored by an organization) has the right to know what is stored and the choice of being completely removed from any usage of their data (opt out). A subject must also provide explicit consent to use their data. With so many storage solutions in place, the management of these GDPR rules is an organizational nightmare and can be highly error-prone in practice. The lowest common denominator is often to lock up most, if not all, collected data. Doing so would render it useless for data analytics and insights; even use cases that are inherently compliant with GDPR guidelines.

Consent and Right to be forgotten

In an enterprise environment, Personally Identifiable Information (PII) can reside in an HDFS cluster, an Amazon S3 bucket, or in a relational database – or a combination of all three at any given time. There may be many files or tables that contain various PII on the same individual, strewn across terabytes of data. Combining that with the GDPR requirement of the right to be forgotten, you are faced with the expensive process of purging the data when a user opted out or when consent was not obtained in the first place.

In practice, removing PII is commonly an Extract, Transform, and Load (ETL) job that runs every few weeks, or once per quarter. Given the GDPR requirements, this is practice is not compliant as it requires these actions to take effect without undue delay (Article 59 of the GDPR states “…at the latest within one month…”). In addition, the data is still active until it is removed, which translates to additional processing to purge, obfuscate, or hide PII records that should no longer be accessible. This functionality is commonly achieved through applying the following filters against each dataset to produce a GDPR-compliant dataset void of offending data:

  • Whitelists – A list of all record IDs of subjects that have given consent to the use of their data.
  • Blacklists – A list of record IDs of subjects that have opted out of the use of their data.

These lists are used to filter out, or allow inclusion of, any matching record. While this approach is not uncommon for data filtration, it is a non-trivial challenge to solve given the many storage locations and sheer volume of data where PII data may reside in most enterprise environments.

The following diagram shows how Cerebro’s Data Access Service (CDAS) solves this problem by filtering data on-the-fly:

CDAS

Instead of rewriting data every time that a person gives consent or opts out, the data is filtered on access (see bullet #1 in the diagram). A choice of white- and/or blacklists is used to define which record should be included in the result set. Also shown here is how CDAS converts every data source into a table structure, regardless of the original format (bullet #2). Every supported format is read, parsed, and the catalog-defined schema applied, allowing higher-level tools to treat each source like a database table. CDAS also enables users to grant access to databases and datasets in a fine-grained manner, including the ability to pass on the grant privilege to other users; effectively building a hierarchy of administrators over the entire data catalog (bullet #3).

The practice of applying filters on queries is not new to DBAs who commonly employ database VIEWs, where the VIEW is a SELECT statement that JOINs the main and filter tables, and, using the default INNER JOIN behavior, removes all single-side records in the process. This is not possible for data sources that do not expose a columnar layout. CDAS solves this problem by abstracting every data source as a table; thereby enabling database manipulation functionality on every dataset.

Another complication with legacy and Big Data systems is that the main and filter tables can be very large, making the JOIN either very expensive or technically infeasible, given resource constraints. CDAS optimizes cluster resources to cache filter tables, thus making large JOINS not only possible, but also fast for recurring operations (bullet #4). Using CDAS as the active middle layer for centralized authentication, authorization, and audit logging meets the GDPR requirement of enforcing the right to be forgotten and the explicit consensus for inclusion, in a reasonable amount of time: Instead of doing bulk deletes every few weeks or months, you can apply changes on the fly as data is being accessed.

As an example, assume that you have a database called demo that has datasets for all user transactions and activity (located in an S3 bucket). You can use the following SQL commands to implement a whitelist filtering all users active within the last 365 days:

1. First, we create datasets that point to transaction and activity data:

CREATE EXTERNAL TABLE demo.transactions(txnid BIGINT, dt_time STRING, sku STRING, userid INT, price FLOAT, creditcard STRING, ip STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ","

LOCATION "s3://cerebro-datasets/transactions"

CREATE EXTERNAL TABLE demo.user_activity(dt_time STRING, userid INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ","

LOCATION "s3://cerebro-datasets/user_activity"

2. Next, we create a VIEW that filters out all inactive users:

CREATE VIEW demo.active_users AS SELECT * FROM demo.user_activity WHERE unix_timestamp(dt_time, "M/dd/yy H:mm") > unix_timestamp(months_sub(now(), 18))

3. Now, we JOIN the two datasets to filter the transactions by user activity and return only those where the respective user was active.
Note the GRANT statement to only allow members of the analyst role access to the VIEW:

CREATE VIEW demo.transactions_active_users AS SELECT t.txnid, u.dt_time AS last_active_time, t.dt_time AS transaction_time, t.sku, t.userid, t.price, t.creditcard, t.ip FROM demo.transactions t JOIN demo.active_users u ON t.userid = u.userid

GRANT SELECT ON TABLE demo.transactions_active_users TO ROLE analyst_role

4. Lastly, we define a view that JOINs the transaction and user activity datasets to filter by inactive users. Here we use an OUTER JOIN to also include all transactions that had a user with no recorded activity at all. The VIEW also redacts all sensitive columns to make the result easier to share:

CREATE VIEW demo.transactions_anonymize_inactive_users AS SELECT t.txnid, decode(u.userid, NULL, "INACTIVE", u.dt_time) AS last_active_time, t.dt_time AS transaction_time, t.sku, decode(u.userid, NULL, "REDACTED", cast(t.userid AS STRING)) AS userid, t.price, decode(u.userid, NULL, "REDACTED", t.creditcard) AS ccn, decode(u.userid, NULL, "REDACTED", t.ip) AS ip FROM demo.transactions t LEFT OUTER JOIN demo.active_users u on t.userid = u.userid

GRANT SELECT ON TABLE demo.transactions_anonymize_inactive_users TO ROLE Analyst_role

CDAS has many more features that can be used in combination; for example, to mask portions of credit card numbers. Refer to the official documentation for details. In summary, you can create any VIEW over any datasource and provide access to each based on a role-based access control using GRANTs as shown.

Coming Next in Part 2

In Part 2 of this series we will discuss Pseudonymization.

We would love to give you more information on how Okera can solve your data access needs, including the looming GDPR requirements. Visit us at www.okera.com to learn more.