Skip to content

Solving GDPR Challenges with Okera (Part 2)

In this two part series we will focus on GDPR challenges facing organizations and how Okera can help solve them. The first part covered the Consent and the Right to be forgotten aspects of GDPR. In this second post we will cover Pseudonymisation.


Another requirement defined by GDPR is the anonymization of PII data. This includes the tokenization of data with the purpose of obfuscating information that could be used to identify subjects without having access to additional data. The GDPR recognizes that pseudonymization is not without limitations and therefore considers such data still as personal.
That being said, pseudonymization is another step in responsible data usage: Why give someone access to data about customers that they do not require? It is much more sensible to not show any unnecessary data, and have users request access where needed. In CDAS the latter is handled by the UI showing the schema of a dataset with all the columns, clearly stating which are accessible by that user, and which are not. By clicking on the “See Groups” link the user can determine which group they need to belong to in order to gain access to the content. A company internal process can then handle the steps to add the user to the necessary group(s).
A typical approach to handle pseudonymization in a heterogeneous Big Data architecture is to perform ETL jobs that save multiple copies of the same datasets while applying the transformation needed. For common example would be the situation where an organization has a master dataset with PII that it would like to share with a team in the company. In order to provide a subset of the dataset to that team without the PII, the organization would create a copy of the master dataset with relevant information or specific data redacted. This approach creates not only unnecessary I/O traffic as data is replicated, but also creates a copy of a nearly identical dataset. The result is more resources being consumed to implement this approach, as well as an additional layer of administration to ensure that the copied data is valid over time.

Okera solves this dilemma by applying ad-hoc transformations based on the current user’s roles when reading the uncurated dataset. Consider the above transactions table, which is operating directly on the underlying files. In an existing system, users would have access to all of the data or none of it. With Okera, you can create a view that applies on-the-fly transformations to provide only the information needed at that time; as shown in the following

CREATE VIEW demo.transactions_safe AS SELECT txnid, dt_time, sku,
decode(effective_user(), "admin", userid, tokenize(userid)) AS userid, price,
decode(effective_user(), "admin", creditcard,mask_ccn(creditcard)) AS creditcard,
decode (effective_user(), "admin", ip, cast(tokenize(ip) as STRING)) AS ip FROM
Demo.transactionsGRANT SELECT ON TABLE demo.transactions to ROLE analyst_role
This VIEW operates on the original table, but applies special functions for the userid,
creditcard and ip columns. Based on the effective user level (that is, based on the user that
issued the query) the content is either masked, or returned as-is. Conversely, if you have
been granted full access to the table, you can read the VIEW like the original table and see
all the data unobfuscated:
admin> select * from demo.transactions_safe where txnid = 17404063
"txnid": 17404063,
"dt_time": "04/29/2016 09:32 PM",
"sku": "sku26",
"userid": 50001,
"price": 720,
"creditcard": "1233-0596-0058-7669",
"ip": "229.825.494.945"
If you are not an administrator, the appropriate masking will be applied to your results:
analyst> select * from demo.transactions_safe where txnid = 17404063
"txnid": 17404063,
"dt_time": "04/29/2016 09:32 PM",
"sku": "sku26",
"userid": -9141668874968000619,
"price": 720,
"creditcard": "XXXX-XXXX-XXXX-7669",
"ip": "8661267957790161894"

Note how the userid and ip values are tokenized; the creditcard number is partially redacted. CDAS supports a number of built-in functions, and you can write your own user-defined functions (UDFs) as you see fit.

Outlook and Summary

There is an additional advantage of using Okera as a single, unified access layer over all your data sources: It helps with the GDPR requirement called Right of access. It includes, as you may have guessed, the right of subjects to get access to their data. The unified catalog of Okera keeps all of the datasets and their details in one place, and using the single view over the data in multiple systems simplifies the extraction of all records for a given subject.

Okera solves the technical GDPR requirements ad-hoc while providing federated access, and unified authorization and audit logging. This makes managing the vast amounts of data held in Big Data systems easier and streamlines the architecture by allowing users and applications to deal with a single point of access, instead of many different ones.
We would love to give you more information on how Okera can solve your data access needs, including the looming GDPR requirements. Visit us at to learn more.