Understand your data usage

Audit and report on your data for regulatory, tracking, and monetization requirements.

Gain clear insight into data authorization and access patterns

A critical part of governance is visibility. The Okera Audit Engine allows you to quickly determine what data any particular person can access and who has access to a particular data asset. This allows you to confidently scale the usage of your data lake, knowing that proper governance will always be in place.

Rich, Consistent Visibility

Okera provides a consistent (across tools), highly detailed audit log for all access requests to provide visibility to data lake access. The audit captures user context, the catalog object being accessed, the columns within the object being accessed and any (row) filters that have been applied to the request. This level of detail and granularity can be used to answer many usage and compliance questions.

Answer Data Auditing Questions

Detailed auditing to understand how data is being consumed and automatically detect properties of the data, such as whether or not the data contains personally identifiable information (PII). Data lake owners can answer questions such as:

  • Which users have accidentally accessed PII data?
  • How much of the data lake is properly cataloged?
  • Which tools are accessing the lake and how much did they read?

Out-of-the-box reports

ODAP provides out of box reports that are built into the Web UI. These answers common usage questions such as:

  • Number of queries over time
  • Number of queries by technology (e.g. Apache Hive vs. Apache Spark vs. Python)
  • Number of queries by dataset
  • Top users

Custom analytics

Okera exposes audit activity as datasets that can be queried and analyzed using any kind of analytics tool. You can use it to model user and dataset access behavior over time, integrate with Microsoft Active Directory to understand usage across business units, or join with catalog metadata such as tags.

Integration with log management tools

The audit data is persisted as JSON log files, which can then be integrated with any existing log management solution such as Splunk or with custom processing engines such as AWS Lambda. This can be used to build real time dashboards or integrated with existing workflows.