The modern enterprise data analytics platform is sophisticated, which means ensuring secure access to the data is complex. It requires being sure that no matter where data is being stored or what application is used to consume it, you can guarantee the same user is always able to access what they are supposed to.
The complexity comes from data being stored in a variety of systems, accessed by a variety of applications, and powering a variety of business use cases. Each of these source storage systems and data consumption tools can vary tremendously in capabilities, yet secure data access must be solved uniformly across all of them. It’s also important to recognize that this complexity is not a negative – it is a reflection of how the data is being used in increasingly diverse use cases to provide value to the organization. This makes it critical to support secure data access.
Variety in the platform comes from these dimensions:
- The data itself – How big is it? How much sensitive information does it contain? What is the required access granularity to protect it? How many users currently use it? How well curated is the data and metadata associated with it?
- The system storing the data – What is the data model of the storage system? (e.g. object store vs document store vs relational database) What are its capabilities to support data transformations? What are its scaling, performance and cost characteristics? How does it handle metadata?
- The compute tools – Are they trusted, do they run a client/server model, are they multi tenant or a SaaS product? How pluggable are they, and what is their data model? How do they scale and perform with high data volumes?
For example, many organizations have an analytics platform that includes (but is not limited to):
- Data scientists using Python and JupyterHub or other data science platforms (such as Amazon Sagemaker/Domino Data Labs/DataIKU) to run training on data that resides both in object stores (S3/ADLS/GCS) as well as RDBMS stores and data warehouses.
- Periodic (daily/weekly/monthly) batch jobs running across terabytes or petabytes of data in object stores using Apache Spark.
- Business analysts using BI tools such as Tableau and Looker for operational business dashboards, based on data in a variety of storage systems, using Presto as a compute engine to query across them.
- Exploratory users using infrastructure-less SQL querying services such as AWS Athena and Google BigQuery to explore the data sitting in the object store.
Being able to support this variety of usage patterns is very compelling to the business. This flexibility allows for using the most appropriate technology to solve each problem, and lets users choose the tools they’re most comfortable with.
However, right now, existing solutions typically provide access control at one of the following levels of the stack:
1. Access control enforced between the user and the compute tool. This is appropriate when the compute tool is best treated like a black box. Enforcement is done by intercepting the request, authorizing it, and forwarding the request. An example of this is a BI application querying from a Query-as-a-Service platform such as Google BigQuery.
2. Access control enforced in the compute tool by leveraging its extensibility. This is appropriate when the compute tool is pluggable, has a compatible data model, and is able to do data transformations at scale. Examples of these tools are Apache Hive and Apache Spark.
3. Access control enforced in a separate data access service. This service reads data from the source and performs all the required data transformations before returning data to the consumer. This is appropriate when the storage system does not support the data model, the client is not trusted, or the required policy is not enforceable by the compute tool. An example of this is a Jupyter notebook directly accessing data in Amazon S3 when the policy requires some records to be filtered out or values deidentified.
4. Access control by configuring the storage system. In this case, the policy is able to be natively enforced in the storage system. For example, a policy that only requires all or nothing object access can be enforced natively in Amazon S3 using IAM, or a granular policy could be enforced for data in Postgres using its native access control capabilities.
As most solutions typically only implement a subset of these access patterns, enterprises are forced to make these tradeoffs:
- Do they want security across more tools at the cost of fitting their policies to the “lowest common denominator” of what these tools can do?
- Do they want to centralize all data access through one technology for consistency at the cost of performance or flexibility?
- Do they want to accept having multiple access paths that have differing levels of enforcement (“secure” and “less secure”) even when the same data is being accessed by the same user?
These tradeoffs can significantly limit the adoption and success of these platforms. They force the organization to either compromise on flexibility or risk exposure from relaxed security.
An enterprise solution for secure data access must have an architecture that can simultaneously support all of these access patterns and then, based on request context, decide the most appropriate one for each request. This will guarantee that consistent and secure policies are always applied in the most effective and appropriate way.
Looking back to the earlier examples, a platform with this architecture can:
- For the data science user: Recognize that access is being requested from a data science tool such as JupyterHub that runs untrusted (e.g. under user laptop), and execute any data transformations and redactions prior to returning the data. In the case where the source is a data warehouse, ensure the policies are properly materialized directly in that source in order to take advantage of the provided query performance.
- For the big data batch jobs: Translate the desired data access policies on the fly so that Spark can execute natively, in order to leverage Spark’s distributed compute capabilities allowing it to directly read from the storage system.
- For the BI and exploratory querying: Authorize the queries and translate the desired data access policies into a rewritten query that the underlying engine (e.g. Presto) can execute.
These decisions are all made dynamically based on who is accessing the data, what data is being requested, the system they’re accessing it from, and the policy required. Across different requests, this architecture supports a seamless transition from one enforcement pattern to another.
Looking forward, modern analytics platforms will only increase in sophistication, and the desire of more users to use the data for new reasons is only going to increase. At the same time, data access must be done responsibly and securely. The solutions that exist today to achieve that are not up to the task and will only fall behind even more as the ecosystem expands. This is why it is critical for a significant architectural evolution to answer this need.
The team here at Okera, with our combined years of experience building the tools and systems that have come to make up this new modern data analytics platform, anticipated the need for such an architecture. Over the last three years of working closely with our F500 customers and their evolving data access requirements, we’ve evolved our product and architecture to make sure we’re able to help data platform owners deliver secure data access to their enterprise.
If your organization is currently facing some of the problems listed above, we would love to hear from you and set up a time to discuss how Okera can help you solve them.
If you found this blog topic interesting and would like to learn more, Itay Neeman (VP of Engineering at Okera) is leading a webinar on March 26th called “Enabling Volume and Variety in the Data Lake: How to Secure Data Access at Scale” that goes into this topic in some more depth and specificity. Register now to save your spot.