I recently had the opportunity to participate in eWeek’s “Down the Batch: Trends in Data Orchestration” for the 95th edition of #eWEEKCHAT. The discussion focused on the topic of data orchestration, moderated by eWeek’s Editor in Chief, Chris Preimesberger talking to a panel of experts including:
- Adit Madan – Product & Product Marketing Manager at Alluxio, which provides a cloud data orchestration platform.
- Daniel Graves – CTO of Delphix, which provides a programmable data infrastructure and DataOps platform.
- David Wilmer – Principal Technical Product Marketing Manager at Talend, which provides a cloud data integration and data integrity platform.
Data orchestration is the automation of data-driven processes from end-to-end, often spanning many different systems, departments, and types of data, throughout the entire data lifecycle. As the conversation below suggests, data orchestration is becoming more and more important as organizations manage ever-growing amounts of increasingly distributed data.
Q: To what extent is your company using data orchestration processes?
Me – We help our users make the data in their platform more accessible, and the data orchestration processes are a critical requirement and capability of their platforms. A lot of the value all of us are getting from our data investments comes from being able to operationalize it and to continuously make decisions on it. We need to do this as more data, applications and use cases emerge. Our focus at Okera is to make it easy for applications and data flows to access data with proper controls. We’ve seen that getting the access capabilities right really speeds up all the data processes.
Daniel – I see quite a few different business objectives and outcomes driving investments in data orchestration. Digital transformation is one, with companies bringing new online capabilities to market and improving customer experiences. These new digital scenarios require changes in the way data is served up. One, the data sources range across generations – from mainframe to relational to NoSQL and cloud. Two, the data could be needed in any location or cloud. Third, it has to be used together, which requires consistent transformation across the entire data landscape. Take the insurance industry for example. Business drivers are focused on customer acquisition and customer retention. This drives an application initiative to improve the claims processing experience, which requires data throughout the dev, test, and analytics lifecycles.
David – Talend is an orchestration enabler. Talend provides all the necessary components for proper data orchestration, from scheduling and monitoring, to error handling and reporting. We are also seeing the importance of data orchestration as microservices become more prevalent. Microservices get organizations more modularized. One microservice going offline doesn’t shut down the whole system. In proper orchestration, a process can be re-routed until the down system is back online.
Q: What key advantages do real-time or near-real time data orchestration have?
Me – Real time or near real time provides huge benefits. Aside from the obvious benefit of having the latest data to work with, having this capability means you’ve operationalized and made it efficient for your organization. This means you’ll have an easier time building new applications and flows, and you’ll be able to address problems such as lineage, data quality, and handling failures. If you’re only planning to do this task infrequently, it’s going to be a challenge to make it robust. If we take some lessons from CI/CD, implementing repeatable and near-real time data orchestration processes is the right approach.
Adit – Delays in data availability essentially means delays in insights. For customers, this accuracy and velocity directly translates to costs. For example, a bank uses data orchestration to determine when to refill ATMs, and near-real-time prevents hefty fines with timely insights.
David – The obvious advantage is minimizing downtime. Smart orchestration can self-correct or at least redirect, so one down process doesn’t shut down the whole system. Future state: bring AI into the mix and orchestration could potentially predict failures. AI is trending across all of IT. Some orchestration solutions are already incorporating machine learning into their orchestration. AI would be the next logical step.
Daniel – One of the key aspects of real-time data, other than timeliness, is the notion of data on-demand versus using predetermined data. Data warehouses and cloud stores house a predetermined set of data, but the data needed for a given analysis might be elsewhere. And while companies need real-time/current data, they also need fine-grained data history for comparative analytics and model training.
Q: To what extent does data orchestration interact with and/or complement network security?
Me – Data orchestration and data security need to be designed hand-in-hand. Data orchestration helps us get data into different systems to do what they do best, but those systems may not have the same security requirements. The systems can sit in different parts of the network, accessible to different users and with different security capabilities. It’s obviously critical to make sure the data is secure, no matter where it ends up. One of the patterns we’ve seen again and again is that whatever the security management solution is, it needs to support dynamic access control. It’s impossible to predict what kind of data will end up where and that naturally evolves. Similarly, it’s not great to restrict users from end systems. What we’ve seen work really well is to think about the data security aspects as a fundamental requirement for data orchestration systems and plan for it at the start!
Adit – Data orchestration is the brain to decide where data lives, for how long, and the representation and number of copies. The implementation of each of these decisions often integrates with orthogonal technologies. I view network security as one such essential complementary tech.
Daniel – Data orchestration and data security have critical intersections. Regulations such as GDPR and CCPA require governance around sensitive data for movement, sharing, access and de-identification. For network security, there are network zones for original or anonymized data. This creates the need for data orchestration technologies to be aware of the nature of the data they are orchestrating. Is it sensitive? Is it subject to an infosec policy or regulation? The deluge of new or expanding data privacy regulations is daunting for organizations. Following GDPR, dozens of other countries have introduced similar regulations, and the same is happening in the U.S. at the state level.
Q: Where does the concept of observability figure into the data orchestration process?
Me – Observability is probably the key challenge for data orchestration. At this point, it’s not that it’s hard to get something built one time, on static data. We all know how to do that. But getting this to always work in a dynamic environment can be incredibly hard. These kinds of data issues can be very time consuming to track down, and the worst part is errors can often go undetected. You may not know there is an issue with your data orchestration process. Observability is key to getting this right. More and better metrics as early as possible.
David – Observability has to be a major part, just like security, automation, data quality, etc. They are all pieces to a larger puzzle. The question is, who is on the other end of the observability capabilities? If no one is watching or being notified, orchestration has failed.
Daniel – Data observability is increasingly important to many different data orchestration use cases, including DevOps, Analytics, and SRE. Take SRE for example, with the goal of improving application uptime. Observing data errors and surgically repairing data is critical to quality.
Adit – Agreed, observability is key. Policies often determine the movement of data across storage systems. Replication to prevent data loss is based on failure detection. And similarly, detection of changes across data centers with associated sync is based on observability.
Q: What should data orchestration be doing, or doing better, than it currently does?
Me – Data orchestration should deepen the integration with the enterprise data platform. It can help with many parts of the data lifecycle and help connect them together. This includes security, cataloging and other components. Aside from the data itself, the metadata integration from the data orchestration processes can be very powerful.
Daniel – Data orchestration applies to many use cases: AI/ML, DevOps, SRE, Cloud Adoption, etc. One important improvement is better integration between data orchestration technologies and the toolchains that need the data. Related, a key improvement is to make all data orchestration available via APIs. Then the data can be automated to the same degree as infrastructure and the SDLC. Data as Code. With Data as Code, organizations achieve higher velocities for all their strategic data-dependent initiatives.
David – I think it starts with the continued maturing of DataOps and the advancement of ML and AI into the orchestration process. And of course leverage (Talend’s) data quality capabilities available through the integration processes. However, I would argue that more important than your orchestration tools would be your integration and data quality tools! (Shameless plug). You can automate and orchestrate the crap out of your processes, but if the data is poor quality, it’s still going to be garbage in, garbage out – just like MDM systems from 15 years ago.
Q: How important is Kubernetes in these new orchestration models?
Me – Kubernetes is definitely important and provides many of the key capabilities data orchestration needs. We want to make it possible to separate DevOps and data engineering. Kubernetes helps us do this in a flexible way.
Daniel – Kubernetes is a key part of the modern tech stack in many organizations, so being able to orchestrate data into, out of, and adjacent to these containers is necessary.
Adit – Kubernetes is emerging as the platform of choice for orchestrating compute across cloud and on-prem environments. Each such environment, then, should use data orchestration technologies like Alluxio to consume data from myriad sources. Being able to host emerging compute technologies, whether AI or analytics, is a major attraction for Kubernetes. Similarly data orchestration must support various APIs for storage access by these compute technologies.
David – Kubernetes is very important in the evolution of data orchestration as organizations make their move to cloud, containerization and platform agnosticity. Talend saw the value of Kubernetes very early on as the leader of containerized orchestration, and now, with Spark on Kubernetes, we are experiencing the next evolution in advanced analytics.
Thanks again to Chris for including me in the chat and to the other participants who made it a lively and thought-provoking conversation! You can check out the full chat on data orchestration on CrowdChat here.
Interested in learning more about the participants? Follow them on Twitter: