Organizations these days are drowning in data – from web analytics, marketing campaigns, IoT events, HR data, and more. However, making effective use of this data for business purposes is still something of an unsolved challenge, in major part due to the need to provide access to this data in a secure and responsible way.
Traditionally, end users of data view any added layers of security or access control as slowing them down. I love the way Jason Chan at Netflix described this point of contention between the data platform team (central IT) and data consumers (analysts and data scientists) in his 2016 AWS re:Invent talk, The Psychology of Security Automation.
Data consumers see governance and information security requirements as stopping them from where they need to go. The teams involved are going to ask you some riddles to “cross the bridge” (get access to the data), and there’s no way to know the correct answer.
On the other hand, this is how security/governance/privacy teams see the risk posed by data analysts and data scientists working with data. They know they will bypass all the rules and regulations (usually without malicious intent!) in order to get access to data to do their jobs.
As a result of this disconnect, many organizations hold this false dichotomy of agility vs. governance when it comes to data access. It’s often posed as a question of either/or – you can have performance or security, security or easy access. But what if instead of a trade-off, there was a bridge that connected both sets of stakeholders, that had security built in from the ground up but was also designed for maximum agility?
Security as the table-stakes foundation for an agile analytics platform; this is our mission at Okera, and what we’re working to solve with our data access platform.
How to Think about Performance and Access Control
Customers ask us about performance all the time. Usually, they have two main concerns: fear of slowing down their end users, and a desire to avoid spending more money to get the same (or worse) performance than they already have.
There are three major areas where an access control solution could possibly add overhead (and by that we mean, in a way that reduces productivity):
- Authorizing a query
- Enforcing the access policy
- Being an organizational bottleneck
Let’s dive into each area and discuss how Okera’s platform handles them.
1. How expensive is it to authorize a query?
We’ve already talked a lot about why we believe attribute-based access control (ABAC) is key to scaling secure data access. Okera’s core architecture follows that of a traditional ABAC system, with a Policy Decision Point and a Policy Enforcement Engine (more on that in the next section).
Our Policy Decision Point is a metadata only service. When a data access request comes to Okera, our platform does three things:
- Locates that dataset in the Metadata Registry.
- Dynamically looks up the relevant attributes and policies for that user on the requested data, and makes a decision on their access.
- Logs the request to access data.
Okera’s platform was designed to be API-first, and we’ve ensured that these metadata calls are as performant as they can be. For example, some of our customers leverage user attributes from Active Directory as part of their Okera ABAC policies, and these attributes are dynamically fetched and compared against policies in the policy decision step whenever a user issues a query. On average in both our internal testing, as well as our customers’ production deployments, this policy authorization step takes mere milliseconds.
Just like the rest of our platform, our Policy Decision Point has been designed to handle maximum throughput and to be deployed in complex environments such as multi-tenant and hybrid setups, drastically limiting the likelihood of this step becoming a bottleneck.
2. How expensive is it to enforce the policy?
That brings us to the second place that security could potentially add overhead: actually enforcing the policy (e.g. masking, tokenization, row filtering, k-anonymization). This is the area about which we hear the most concern.
In an ABAC system, once the request has been authorized at the Policy Decision Point, it’s passed on to some kind of enforcement engine to implement the fine-grained access. What’s unique about Okera’s platform is that our enforcement engine is completely dynamic and enforces the policies on read, which means that no copies of the data are ever created and the organization has the maximum governance agility.
Access at the enterprise level means many hundreds or thousands of users accessing data concurrently, accessing petabytes of data every day. Therefore, any dynamic enforcement engine needs to be architected in a distributed and horizontally scalable way, working in a symbiotic relationship with the tools querying data. This is often where other dynamic access control solutions fall down. The enforcement engine becomes an inadvertent bottleneck, negating the performance and scale benefits of modern analytical tools (Spark, Presto, Snowflake, etc) and reinforcing data consumers’ perception that access control slows them down.
Consistency of enforcement is often overlooked when evaluating solutions; open-source tools like Apache Ranger require duplicating the same policies for each engine, and not all policies apply consistently across these tools. For this reason, Okera’s secure data access platform supports different methods of enforcement, and transparently chooses the most performant method without ever compromising on security, consistency, or the end-user experience.
Okera selects which enforcement method to use based on a combination of factors such as the user, what dataset they are querying, and what type of policy needs to be enforced. For example, a data masking policy can be enforced natively inside Databricks through our connector, whereas there are other cases where the client is untrusted – like a multi-tenant EMR cluster, or a data scientist using Python on their laptop – where the policy enforcement would happen through Okera’s scalable, secure data plane.
An important note: Dynamic policy enforcement ultimately will add some additional computation in order to transform the data to apply the policy. There is no magic tool that can condense that down to zero. That being said, we’ve worked hard to ensure all our methods of enforcement are highly performant and scalable.
Benchmark testing shows that policy enforcement adds a negligible amount of latency overhead (less than 10% in the worst case), and has no effect on query throughput. In many cases, it can actually reduce the total query time, due to policies like masking or row filtering that reduce that amount of data being computed. At the end of the day, we believe the benefits of being able to open up access to sensitive data while maintaining privacy and governance far outweigh any negligible query overhead.
Okera’s Scalable Secure Data Plane
Okera’s “secret sauce,” if you would, is our scalable secure data plane. As mentioned above, there are cases where the client cannot enforce the access policy, and a secure data access layer is needed in order to guarantee security and consistency in policy enforcement. Okera’s scalable secure data plane allows us to streamline access and ensure consistent policy enforcement, truly living up to our promise of a single policy enforced across all tools.
We’ve spent a great deal of effort to ensure our data plane is extremely performant — that it was architected to be fully distributed and horizontally scalable, and can run at the scale of our enterprise customers’ analytical workloads. This has enabled us to work with customers with petabyte-scale data lakes, where we protect trillions of rows of data being read on a daily basis, across hundreds of nodes.
In order to ensure transparent scale and end user experience, Okera’s data plane is also able to be deployed on the same infrastructure as your existing analytics compute – what we call the “co-located” deployment mode – while still providing scalable and secure data access. This significantly reduces total cost of ownership to platform teams and enables more seamless chargeback to the lines of business.
3. How, where, and by whom are policies created?
We’ve talked a lot about the performance of the technologies that power enforcement: the policy decision point and the enforcement engine. However, an often-overlooked aspect of potential overhead is the underlying access model and user experience. Without the ability for non-technical users to manage policies, large enterprises will have a tough time scaling access control.
At Okera, we believe the ideal model to scale access control is something called “distributed stewardship.” This means delegating the necessary permissions that will empower data owners and data stewards to manage day-to-day access to their data, while still maintaining centralized oversight, auditing, and governance standards. Data access solutions should seek to embrace this model, or risk hindering the productivity of their end users.
We’ve designed an underlying access model that is least privilege by default and supports rich and flexible policy definition, but also has higher level grouping constructs such as roles and attributes to make it easy to group your organization’s data into domains and delegate data stewards. We’ve heard from our customers that other access management tools require complex change management, bespoke processes and workflows in order to be effective. At Okera, we want to be simple enough that you can start seeing value from day one, but with an underlying access model that’s flexible enough to adapt and grow as your data platform evolves.
At Okera we hold the radical belief that not only does security not need to slow you down, but by building in security, governance, and access control as the foundation of the data platform you will actually make your organization more agile and productive.
Hopefully this has given you a good framework on how to think about performance when thinking about data access control for your organization. Our product vision is clear – to be a consistent, scalable, performant data access platform with a seamless user experience – and we believe we are the best at what we do.
If you have any questions, or would like to see a demo of our platform and scalable secure data plane, please contact us for more information.
If you found this topic interesting, Lars George (Principal Solutions Architect) just gave a webinar about How to Enforce Fine-grained Access Control on EMR without IAM Roles, which you can download here.