Importance of Database Schema Design
Database schema design is an important part of achieving successful data organization and optimal data management. Data organization is a vital performance optimization technique proven by use in legacy data warehouse architectures, where it is common that data is stored in accordance to its use case. For example, transactional data is stored in a row based format, where each row represents a transaction record. In contrast, analytical data is commonly stored in columnar form; organized by facts and dimensions across many separate tables.
One Schema Format and One Registry
A Simple Approach to Database Schema Design
TABLE
object with each attribute matching the original schema. A developer or analyst could then use the resulting table to feed dashboards or be a part of complex queries using the SQL JOIN
operator as they would with any other SQL table. For the data owners, this approach defines a one-to-one mapping from the data source into the Okera Platform, thus allowing data consumers to immediately begin using their data.The Alternative: Proxy VIEWs
VIEW
. That way, the one-to-one table structure is disconnected from the users, so that low-level changes are only reflected in the VIEW
definition. All downstream consumers have to use the proxying view instead of the table, putting them out of harms way (that is, shield them from certain structural changes). Using SQL GRANT
statements, the DBA can revoke access to the table and allow access to the view instead.VIEW
could be defined with a new name to represent the way forward. Those who need to get access to the latest schema can migrate to the new view as necessary.Reference Schema Design
DISTINCT
or GROUP BY
operators in order to deliver the most efficient administrative experience possible. What we recommend is adding another layer in the schema design architecture to account for these access controls measures, that is, the access control views.SELECT
statement, combined with a WHERE
clause that filters what rows are returned. In addition, the DBA can use SQL functions to mask or tokenize column values based on the user running the query.Compute Engines
Okera supports this by allowing DBAs to define special VIEW
s that are registered with the Schema Registry, but are not evaluated in Okera. For example, consider this VIEW
definition (note that, for the sake of the example, the sales_transactions
dataset is assumed to be an access control view):
CREATE VIEW sales_transactions_cal
AS SELECT <fields> ...
FROM sales_transactions
WHERE region = 'california'
SELECT * FROM sales_transactions_cal
VIEW
as a TABLE
definition (that is, it returns the fields in the final schema) when requested by the compute engine. This can be seen when using the SHOW CREATE TABLE
command from the Hive CLI (shown abbreviated):hive> SHOW CREATE TABLE sales_transactions_cal;
OK
CREATE EXTERNAL TABLE sales_transactions_cal(
txnid bigint COMMENT '',
dt_time string COMMENT '',
sku string COMMENT '',
userid int COMMENT '',
price float COMMENT '',
creditcard string COMMENT '',
ip string COMMENT '')
...
Okera supports the EXTERNAL keyword to define a view that is different from Okera’s default behavior. For example:
CREATE EXTERNAL VIEW revenue_cal
AS SELECT min(revenue) as minRevenue, max(revenue) as maxRevenue
FROM sales_transactions
WHERE region = 'california'
EXTERNAL
keyword changes how the Okera Data Access Service (ODAS) Planner returns the VIEW
definition when requested by the compute engine, returning the full VIEW
definition instead:hive> SHOW CREATE TABLE revenue_cal;
OK
CREATE VIEW revenue_cal AS SELECT * FROM salesdb.sales_transactions
WHERE region = 'california'
VIEW
occurs outside of Okera Platform but inside the downstream compute engine. All of the unsupported SQL is handled downstream as well, while the sales_transaction
access control view is still handled by Okera; applying all the filtering and audit logging as expected.Impact and Outlook
On top of that, the layered view approach in the Okera Platform provides DBAs with additional data management capabilities. For example, it allows DBAs to limit the amount of data returned for a partitioned table when no matching WHERE
clauses was defined. The special Okera-provided LAST PARTITION
feature can avoid unnecessary full table scans (such as SELECT * FROM
to check what data a dataset holds, just to realize it has terabytes or petabytes of data) that may inadvertently choke up valuable system resources.
Okera recommends these guidelines to provide for a future-proof, stress free management of datasets – something administrators and users will certainly appreciate.