Apache Ranger Guide for Access Control & Data Governance in the Cloud
Apache Ranger is an open-source project for providing data access control in a Hadoop ecosystem. It is (now merged with Cloudera) as a complete solution for effecting data governance and access controls in the cloud.
Okera’s customers and prospects — most of whom have built or are in the early days of building data lakes on AWS S3 — frequently mention Ranger as a viable component for their technology stacks. A few have worked with it as part of their research and due diligence efforts.
It makes sense, under the right conditions, that Apache Ranger can be an effective component. At Okera, we ask our prospects if they are:
- Implementing a data lake in the cloud on S3
- Need to consider access control for their use cases
- Need a governance model to support big data processing, analytics, and ML
If the answer to these questions is “Yes”, read on to understand more about how Ranger may or may not solve the problem of access control and governance in cloud data lakes.
Complex Architecture, Application Design, and Security
Cloud data lakes provide lines of business a broad platform for analytics and machine learning. The goal is to get insights from data that will inform business decisions and drive value for customers. The platform teams that support data lakes want to enable more adoption, which means more lines of business, more product and solution partners, auditors, and regulators. They all will need access to the same data, but in a form that suits their roles and responsibilities.
Platform teams therefore need an access control system to protect sensitive information, one that will support multiple kinds of workloads and access patterns without limiting data consumers to a prescribed set of tools. Accounting for the needs of all these roles in a dynamic manner makes it challenging to design and implement a security model that can scale without impeding adoption.
New users tend to look for access paths of least resistance. They’re not likely to assume the governance model has been designed to point them out. Impatient users may try to copy the data they need to save time. Others may want to rewrite pipeline code to a language they prefer, or rely on trusted tools (e.g., JDBC) to minimize their troubleshooting and learning curve. Others will stick to proprietary frameworks or tools that are a sunk cost they have to justify.
These forces influence the way an application platform evolves, and in particular how it is secured. Consider Hadoop. Hadoop famously co-locates its storage and compute services on each cluster node. All compute services (MapReduce, Hive, Spark, Impala, Presto, etc.) read data from HDFS.
Running a workload through Hadoop from an external client (an R or Python program, for example, or a BI interface such as Tableau) requires some additional drivers or configuration, in particular for security. Security administrators tend to want to minimize these access paths to reduce the cluster’s exposure to attack. These protections can add complexity for new use cases, reduce agility in governance, and slower adoption. This is where Ranger comes in.
Apache Ranger provides a comprehensive access control system for several Hadoop components, including HDFS, Hive, and others named below. Ranger is designed to plug in to the processes of each service it supports. Administrators can then apply authorization policies from and bring log and audit information to a central console.
Ranger is most widely used with HDP and is included in its distribution. It is promoted as a complementary service to Apache Atlas (which provide governance and metadata services), Apache Ambari (for UI-driven install and configuration), and Apache Solr (which supports search on Ranger’s audit logs).
Ranger gives a much-needed supplement to Hadoop’s default, open-arms access, but it also tailors its authorization model to each service’s data model. Ranger policies can address HDFS file permissions, Hive tables, HBase column families, and more. Each fit is straightforward, but becomes problematic as soon as you ask, is the security uniform across all these services?
Hive is an SQL-friendly interface, but it also hides the compute service it uses from the user. A client also could access the Hive Metastore through HiveServer2, or use its own table definitions on top of HDFS files it reads directly. An Apache Ranger administrator must either develop a system to maintain consistent policies for all three access paths, or limit support to what is needed: access for HiveServer2 only; access to HiveServer2 coordinated with HDFS (difficult without Ambari); or Hive CLI user access. The limits of this technology become, in effect, the limits of the data governance model.
Cloud Platforms Change the Landscape for Application Architecture
In public clouds, storage and compute are discrete, uncoupled services. The separation of storage and compute is a paradigmatic change, and not one everybody sees at first blush. These storage services — AWS S3, Azure ADLS, and Google Cloud Storage — are highly-scalable object stores that remove the operational complexity of HDFS from view. This simpler foundation brings the capability of a data lake into sharper focus for the enterprise.
Data lake architecture allows the enterprise to select best-of-breed compute and analytic services provided by any vendors or built on any framework. To realize that goal, however, it’s necessary to situate access control and governance between the data lake and its compute clients, something Ranger cannot do.
Our customers tell us that Ranger seems like an appealing option, but none have advanced with it beyond a proof-of-concept. There are a number of operational reasons why this is the case, but we think the final lies in the full value they want from a data lake. Let’s take a closer look at how cloud platform providers define it.
Data Lakes are a Capability, Not a Technology Stack
Here’s how AWS defines a data lake:
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboard and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
All the traditional benefits of Hadoop apply here — low-cost storage, schema-on-read analysis, no ingest requirements on raw data, openness to multiple forms of compute — but with a difference. The virtualization of distributed storage leaves clients free to apply the compute and the resource management service they prefer.
Ranger must plug in to a Hadoop service, which by definition is a storage or compute component. Ranger has no plugin for S3, which closes the door on a way to implement Ranger that applies to all compute services.
Apply Ranger at the compute level means applying it for all compute services or, as described above, limiting access to the compute paths that require coverage. In either case, the same conditions emerge, just in varying degrees:
First: Hadoop’s compute engines aren’t functionally consistent services. Combining Ranger with Hive is a popular choice because Hive encapsulates both service types below its query layer. Indeed, Ranger generally supports compute services that have a built-in data model. Spark is a leading choice among developers for Hadoop workloads, but there is no Ranger plugin support for it.
Second: plugging in to a compute service’s processes means setting authorization controls are enabled in user space. This is a suboptimal practice for security. As more compute services are secured this way, administrators must mind the potential for attacks particular to each one.
Third: adding security to the compute layer binds storage and compute together, even those separated by design. Hive does this by design to expose a query-only layer to the end user. This tradeoff benefits Hive clients, but at some cost to every other compute service that needs a metadata model but also more general, discrete access to the underlying services. And while Hive can use S3 as a storage service, Ranger cannot provide controls or even insight to access requests to S3.
Hadoop, nearly from its beginning, has had to adapt to the security needs and concerns of enterprise production. This work, albeit painful at times, brought enterprises closer to Hadoop’s promise of unprecedented power made possible by its distributed storage and general processing frameworks. Apache Ranger, a best-of-breed component for centralized, policy-based access control and governance enablement, plays a major role in keeping that promise.
Fully leveraging data lake architecture in the cloud, however, means open a wider door. We can design access controls and data governance models that neither impose an application platform nor preclude one from getting to business data. Cloud-based data lakes are a capability designed to enable your technology stack — not the other way around. As more enterprises explore this option for their use cases, we’re confident they’ll want its full potential, both to inspire more users and keep pace with their demand.