A brief history of data systems
Relational database management systems (RDBMS) have been around for over 50 years. They have been at the heart of enterprises, supporting everything from ledger systems to web-based applications. These systems were designed and optimized for business data processing and are commonly referred to as online transactional processing (OLTP) systems because they support the day to day operations of a company.
OLTP workloads need databases that can handle a high volume of transactions and ensure Atomicity, Consistency, Isolation, and Durability of data – meaning they are ACID compliant. But in addition to the transactional support needed to run a business, organizations over time realized that they needed deeper insights into their overall business performance.
To do this, organizations needed to bring together data from multiple sources into a centralized location, so that operational data could be aggregated, enriched, correlated, and analyzed to produce deep insights on business performance and trends. These online analytical processing (OLAP) workloads need specialized optimization of data stores to handle complex joins across large datasets, something that is outside the normal workload for an OLTP system. Thus the idea of a centralized data warehouse was born.
Early data warehouses were generally built on existing RDBMS stacks, and the adaptations made to that tech were never really sufficient to support the volume, variety, and velocity of the Big Data era. As more companies embraced the Internet and digital transformation, data volumes and types also increased dramatically. Up until the mid- to late 1990s, most of the data being generated by and for companies was structured or semi-structured in nature. With the rise of social media, sharing platforms, and IoT devices, the types of data available became more varied. Data warehouses could only handle structured and semi-structured data, and were not the answer for the growing unstructured data ingested from the new sources. A new method of collecting, storing, and exploring these combined data types was needed.
Cloud to the rescue
Facing the shortcomings of traditional data warehouses and data lakes on-premises, data stakeholders struggle with the challenge of scaling infrastructure, finding critical talent, improving costs, and ultimately managing the growing expectation to deliver valuable insights. Furthermore, as enterprises are increasingly becoming datadriven, data warehouses and data lakes play a critical role in an organization’s digital transformation journey. Today, companies need a 360 degree real time view of their business to gain a competitive edge.
In order to stay competitive, companies need a data platform that enables data-driven decision making across the enterprise. But this requires more than technical changes; organizations need to embrace a culture of data sharing. Siloed data is silenced data. To broaden and unify enterprise intelligence, securely sharing data across lines of business leads is critical.
When users are no longer constrained by the capacity of their infrastructure, data nirvana is reached when the value-driven data products are only limited by an enterprise’s imagination. Utilizing the cloud supports organizations in their modernization efforts because it minimizes the toil and friction by offloading the administrative, low-value tasks.
By migrating to the cloud and modernizing these traditional management systems, organizations can:
• Reduce their storage and data processing costs5 .
• Scale to ingest, store, and process and analyze all the relevant data both from internal sources and from external and public ones.
• Increase time to value by enabling real time and predictive analytics.
• Embrace a data culture across the organization and enjoy the best of breed analytics and machine learning (ML).
• Leverage simple and powerful data security and governance across layers.
• Democratize data, which needs to be easily discovered and accessible to the right stakeholder inside and outside of the enterprise in a secure manner. Cloud enables accessibility and offers tools so that skill sets do not define the limitation of a business user embedding data into their daily work. This may look like simplified reporting tools, cloud-back spreadsheet interfaces, and drag-and-drop analytic tools.
Data warehouse and data lake convergence
As mentioned previously, some of the key differences between a data lake and a data warehouse relate to the type of data that can be ingested and the ability to land unprocessed (raw) data into a common location. This can happen without the governance, metadata, and data quality that would have been applied in traditional data warehouses.
These core differences explain the changes around the personas using the two platforms:
• Traditional data warehouse users are BI analysts who are closer to the business, focusing on driving insights from data. Data is traditionally prepared by the ETL tools based on the requirements of the data analysts. These users are traditionally using the data to answer questions.
• Data lake users (in addition to analysts), include data engineers and data scientists. They are closer to the raw data with the tools and capabilities to explore and mine the data. They not only transform the data to business data that can be transferred to the data warehouses but also experiment with it and use it to train their ML models and for AI processing. These users not only find answers in the data, but they also find questions.
As a result, we often see these two systems are traditionally managed by different IT departments with different teams. They are split between their use of the data warehouse and the data lake. However, this approach has a number of tradeoffs for customers and traditional workloads. This disconnect has an opportunity cost; organizations spend their resources on operational aspects rather than focusing on business insights. As such, they cannot allocate resources to focus on the key business drivers or on challenges that would allow them to gain a competitive edge.
Additionally, maintaining two separate systems with the same end goal of providing actionable insights from data can cause data quality and consistency problems. By not aligning on the storage and transformations of the data, there may end up being two different values for what is ostensibly one record. With the extra effort required to transform data that have standardized values, such as timestamp, many data users are less compelled to return to the data lake every time they need to use data. These can lead to data puddles across the enterprise, which are datasets stored on individual’s machines, causing both a security risk and inefficient use of data.
For example, if an online retailer spends all their resources on managing a traditional data warehouse to provide daily reporting which is key to the business, then they fall behind on creating business value from the data such as leveraging AI for predictive intelligence and automated actions. Hence, they lose competitive advantage as they have increased costs, lower revenues and higher risk. Effectively, it is a barrier to gain a competitive edge. The alternative is to use fully managed cloud environments whereby most of the operational challenges are resolved by the services provided.
Cloud computing introduced new deployment methods of large scale DBMS where the storage is not co-located with the compute servers. Managing both the storage and compute in distributed and elastic clusters with resiliency and security in place still requires administrative overhead to ensure that capacity is in place for the converged system. Cloud’s global scale infrastructure and native managed services provide an environment that optimizes the convergence of the data lake and data warehouse, providing the benefits of having both the lake and the warehouse without the overhead of both.
Cloud’s nearly limitless scalability is what enables the convergence of data warehouses and lakes. Having data centers full of servers that allocate storage and compute differently enables distributed applications for interactive queries. The amount of data and the compute required to analyze it can scale dynamically from warm pools of compute clusters. When storage is decoupled from the compute, it can be used for many different use cases. The same storage that was once file-based data lakes for structured data, can now be the same storage and data for the data warehouse. This key convergence enables data to be stored once, utilizing views to prepare the data for each specific use case.
An example of this would be utilizing the same underlying storage for a data warehouse that serves BI reporting for the storage that a Spark cluster uses. This enables Spark code that data lake teams spent years perfecting to take advantage of the more performant storage that is often used as part of a distributed computing system. It allows the compute to move to the data, rather than the data to have to shuffle. This unlocks better speed and performance without requiring high-end infrastructure. Many clouds offer this as a managed service, further abstracting the required management of the infrastructure, much like converging the storage of these two systems.
Our customers face common challenges and trade offs when they try to build a single monolithic platform:
• IT Challenge: Data sits across multiple storage types and systems — data warehouses, data lakes, data marts that may be located on-premise, in a single cloud, or across multiple cloud providers. Customers are forced to either distribute their data governance and replicate the overhead of security, metadata management, data lineage, etc across systems, or copy large amounts of “important” or “sensitive” data into one large system that is more tightly controlled than the rest.
• Analytics Challenge: Analytics tools cannot always access the right data and related artifacts. Organizations usually find themselves having to choose between giving their analytics team free reign or limiting data access, which can in turn hamper analytic agility.
• Business Challenge: Data trustworthiness is a big issue. Business users want to have more data ownership, which would give them more trust in the data, but freer access to data can potentially lower its quality. Organizations need to decide whether to optimize for more access with a potential lower data quality, or to tightly control access in an attempt to maintain high data quality.
These challenges create unintended tension among teams. Every organization wants a platform that provides secure, high quality data, that is accessible to the right data users. What if they don’t have to compromise?