Why do we need Database Schema Design?
Data organization is a vital performance optimization technique proven by use in legacy data warehouse architectures, where it is common that data is stored in accordance to its use case. For example, transactional data is stored in a row based format, where each row represents a transaction record. In contrast, analytical data is commonly stored in columnar form; organized by facts and dimensions across many separate tables.
One Schema Format and One Registry
A Simple Approach to Database Schema Design
TABLEobject with each attribute matching the original schema. A developer or analyst could then use the resulting table to feed dashboards or be a part of complex queries using the SQL
JOINoperator as they would with any other SQL table. For the data owners, this approach defines a one-to-one mapping from the data source into the Okera Platform, thus allowing data consumers to immediately begin using their data.
The Alternative: Proxy VIEWs
VIEW. That way, the one-to-one table structure is disconnected from the users, so that low-level changes are only reflected in the
VIEWdefinition. All downstream consumers have to use the proxying view instead of the table, putting them out of harms way (that is, shield them from certain structural changes). Using SQL
GRANTstatements, the DBA can revoke access to the table and allow access to the view instead.
VIEWcould be defined with a new name to represent the way forward. Those who need to get access to the latest schema can migrate to the new view as necessary.
Reference Schema Design
GROUP BYoperators in order to deliver the most efficient administrative experience possible. What we recommend is adding another layer in the schema design architecture to account for these access controls measures, that is, the access control views.
SELECTstatement, combined with a
WHEREclause that filters what rows are returned. In addition, the DBA can use SQL functions to mask or tokenize column values based on the user running the query.
Okera supports this by allowing DBAs to define special
VIEWs that are registered with the Schema Registry, but are not evaluated in Okera. For example, consider this
VIEW definition (note that, for the sake of the example, the
sales_transactions dataset is assumed to be an access control view):
CREATE VIEW sales_transactions_cal AS SELECT <fields> ... FROM sales_transactions WHERE region = 'california'
SELECT * FROM sales_transactions_cal
TABLEdefinition (that is, it returns the fields in the final schema) when requested by the compute engine. This can be seen when using the
SHOW CREATE TABLEcommand from the Hive CLI (shown abbreviated):
hive> SHOW CREATE TABLE sales_transactions_cal; OK CREATE EXTERNAL TABLE sales_transactions_cal( txnid bigint COMMENT '', dt_time string COMMENT '', sku string COMMENT '', userid int COMMENT '', price float COMMENT '', creditcard string COMMENT '', ip string COMMENT '') ...
Okera supports the EXTERNAL keyword to define a view that is different from Okera’s default behavior. For example:
CREATE EXTERNAL VIEW revenue_cal AS SELECT min(revenue) as minRevenue, max(revenue) as maxRevenue FROM sales_transactions WHERE region = 'california'
EXTERNALkeyword changes how the Okera Data Access Service (ODAS) Planner returns the
VIEWdefinition when requested by the compute engine, returning the full
hive> SHOW CREATE TABLE revenue_cal; OK CREATE VIEW revenue_cal AS SELECT * FROM salesdb.sales_transactions WHERE region = 'california'
VIEWoccurs outside of Okera Platform but inside the downstream compute engine. All of the unsupported SQL is handled downstream as well, while the
sales_transactionaccess control view is still handled by Okera; applying all the filtering and audit logging as expected.
Impact and Outlook
On top of that, the layered view approach in the Okera Platform provides DBAs with additional data management capabilities. For example, it allows DBAs to limit the amount of data returned for a partitioned table when no matching
WHERE clauses was defined. The special Okera-provided
LAST PARTITION feature can avoid unnecessary full table scans (such as
SELECT * FROM to check what data a dataset holds, just to realize it has terabytes or petabytes of data) that may inadvertently choke up valuable system resources.