What is Apache Ranger?
Apache Ranger is an open-source project for providing data access control in a Hadoop ecosystem. It is (now merged with Cloudera as) a complete solution for effecting data governance and access controls in the cloud.
Okera’s customers and prospects — most of whom have built or are in the early days of building data lakes on Amazon S3 — frequently mention Ranger as a viable component for their technology stacks. A few have worked with it as part of their research and due diligence efforts.
- Implementing a data lake in the cloud on S3
- Need to consider access control for their use cases
- Need a governance model to support big data processing, analytics, and ML
Apache Ranger’s Access Control System
Cloud data lakes provide lines of business a broad platform for analytics and machine learning. The goal is to get insights from data that will inform business decisions and drive value for customers. The platform teams that support data lakes want to enable more adoption, which means more lines of business, more product and solution partners, auditors, and regulators. They all will need access to the same data, but in a form that suits their roles and responsibilities.
Apache Ranger provides a comprehensive access control system for several Hadoop components, including HDFS, Hive, and others named below. Ranger is designed to plug into the processes of each service it supports. Administrators can then apply authorization policies from and bring Range audit information to a central console.
Cloud Platforms Change the Landscape for Application Architecture
Data Lakes are a Capability, Not a Technology Stack
Ranger must plug into a Hadoop service, which by definition is a storage or compute component. There is no Apache Ranger S3 plugin, which closes the door on a way to implement Ranger that applies to all compute services.
First: Hadoop’s compute engines aren’t functionally consistent services. Using the Ranger Hive plugin is a popular choice because Hive encapsulates both service types below its query layer. Indeed, Ranger generally supports compute services that have a built-in data model. Spark is a leading choice among developers for Hadoop workloads, but there is no Ranger plugin support for it.