When data specialists talk about data access, they are usually referring to structured data, i.e databases/tables/columns etc.. This assumption is not only common, but logical, since these same people prefer to store the data they collect in such a way that makes querying it easy. This is the mere definition of structured data: data that is organized in some tabular structure. Subsequently, the tools for access control management are optimized for this type of data and demonstrate how one could grant a permission on some level and enforce it.

 

Unstructured data, on the other hand, is data that can’t be mapped into such a structure. Good examples of unstructured data include a set of pictures (like medical x-ray images) or a set of audio/video files. Both of these items lack commonality in how they are structured and sized, and therefore can’t really be queried in a reasonable fashion. That said, unstructured data is actually very common; it’s just inconvenient to use with the standard offering of  industry data querying tools. 

https://venturebeat.com/data-infrastructure/report-80-of-global-datasphere-will-be-unstructured-by-2025/

 

In order to help with managing the unstructured data, we introduced OkeraEnsemble.

OkeraEnsemble extends advanced access control capabilities to unstructured data, allowing our customers to set up permissions on their files and directories. It works somewhat similar to permissions on structured data, but the “structure” part is provided by the file system where the files are stored. 

This approach works great until there are groups of files that need to have different access semantics, ie let’s say we have a directory path like this 

 

‘s3://images/hospitals/seattle-grace-hospital/‘ 

 

and there are subdirectories in it like `x-rays`, `mris`, `document_scans`. Here we might want to only enable roles to access certain subdirectories (let’s say only `x-ray` and `mris` of those 3). One way of  achieving that would be to produce a permission per each subdirectory but that leads to an explosion of permissions on a certain scale. 

 

In order to address the permission explosion problem, OkeraEnsemble brings ABAC to URI permissions. ABAC stands for Attribute Based Access Control and allows leveraging objects’ attributes in permission definitions. Leveraging ABAC enables us to assign attributes (also called tagging) to directories and files and then reference these tags in permissions. Using tags allows us to define human-understandable permissions, as well as drastically reduce the number of permissions we need to manage.

Another feature of using ABAC approach is that it also allows us to dynamically control access to data by changing tag assignments. So with the policy we defined previously we now can remove access from any file/directory within the path defined there by removing the referenced tag from that file/directory. 

 

We should add an important note here that this approach essentially shifts complexity towards the tag management space and now assigning the right tags to the right data becomes crucial. On the other hand tag management effort as a part of overall metadata management seems like a necessity anyway, and this way we can harness it for the permissions, greatly simplifying those in the process.

 

In order to support the new workflows we are extending our DDL set with new commands.

In early 2023, we will also be adding a new UI page to see and manage URIs and also allow more enriched interactions, like tagging, creating permissions, advanced filtering etc. 

When navigating to one of these URIs from the list you can see more detailed information about it, like all the permissions that allow access to it or the date/time it was last updated.

These together provide the building blocks needed to allow both visual and automatable (via DDLs) management of unstructured data.