Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing, the first one to appear was the Data Warehouses, introduced in the 80’s with the support of Online Analytical Processing (OLAP) systems, helping organizations face the rise of diverse applications in the 90’s by centralizing and supporting historical data to gain competitive business analytics. Later in the early 2000s Data Lakes appeared, thanks to innovations in cloud computing and storage, enabling to save an exorbitant amounts of data in different formats for future analysis.
To this day both solutions remain popular depending on different business needs. For instance, data warehouses allow for high-performance Business Analytics and fine grained data governance. However, they lack on affordable scalability for petabytes of data. On the other side of the pitch, data lakes enable high throughput and low latency, but they have issues with data governance leading to unmanageable “data swamps”. In addition, data is considered immutable, which leads to additional integration efforts.
This is why we can find modern data lake and data warehouse ecosystems converging, both getting inspiration, borrowing concepts, and addressing use cases from each other. In this landscape we find a new architecture emerge: the Data Lakehouse, which tries to combine the key benefits of both competing architectures, offering low-cost storage accessible by multiple data processing engines such as Apache Spark, raw access to the data, data manipulation, and extra flexibility. Let’s review each one of them in detail.
There are multiple indicators to consider when selecting a database architecture. To be more comprehensive, we pre-selected a set of common concerns.
A data warehouse is a centralized system designed to store present and historical data. Their focus is to provide readily available data for advanced querying and analysis. The main motivation for the emergence of data warehouses relied on solving the inconsistent data of RDBMS by transforming data from operational systems to analytical-processing support systems. The data warehouse is considered as the primary source of truth on business operations.
A standard data warehouse architecture (image above) uses Extract, Transform and Load (ETL) for data transit through three different layers, data staging, data core, and data access. The first layer sees raw data format transition to a fully transformed set. Then data schema-on-write is used to optimize the data model for downstream BI consumption. Finally, the access layer lets users retrieve the translated and organized data through SQL queries.
Data warehouse powers reports, dashboards, and analysis tools by storing data efficiently. They minimize input and output (I/O), allowing query results to be delivered faster and to multiple users simultaneously. Additionally, subsets of the data warehouse, called data marts, can be provided to answer specialized analytical needs.
Let’s review data warehouses by our selected indicators:
As a result, data warehouses excel at providing quality and consistency of the data; they allow for performant data analytics and BI thanks to their ACID compliance.
The limitations of data warehouses start when there is the need to use varied types of data like textual, streaming IoT or multimedia. In addition, the rise of machine learning and AI requires iterative computations that need direct raw data access.
Data lakes first appeared thanks to the inception of Hadoop which allows for multiple machines to work as one, allowing compute and storage to scale horizontally. Additionally, Hadoop allows data to be ‘thrown in’ without worrying about its structure. This works thanks to a schema-on-read process, where data is ingested in its raw format and only transformed when needed for analysis, enabling for semi-structured and unstructured data support. Lastly, data lake architecture relies on a metadata catalog (e.g., Hive Metastore, Apache Glue). Without the metadata catalog, data lakes quickly become unmanageable and drift into data swamps.
Later on, cloud services, AWS being the first one, introduced capabilities to decouple compute and storage, which mean they could scale independently. Moreover, many upgrades were possible thanks to Hadoop ecosystem being open-source. This include big data frameworks (e.g., Hortonworks, Cloudera, mapR) which aid on Hadoop components handling, as well as the addition of new open-source tools like Apache Spark, which started to be used as the processing engine, enabling performant ETL to transform raw data into refined and structured data for different uses cases.
Nevertheless, data lakes have difficulties providing data management, guaranteeing quality, security, and data governance. On top of that, if you need high performant SQL queries on petabytes of data and return complex analytical results fast, or the use of BI tools and features such as schema enforcement data and versioning, then data lakes are not enough. That’s why the current use of data lakes has evolved to a two-tier architecture alongside data warehouses.
In this two-tier architecture, structured data is traditionally stored in raw format in the data lake, but is later on processed and stored in tabular format in the data warehouses, as we can see in the image above. The data stored in data warehouses are used for data analytics and Business Intelligence, while semi-structured and unstructured data is stored in the data lake and used for Data Science and Machine Learning.
Let’s review the two-tier architecture with the preselected indicators:
|BI analysts have limited access to data warehouses in the two-tier architectures, relying on data engineers to structure data into the warehouse. Lastly, data lakes often become data swamps when there is incorrectly cataloged metadata, making that data unmanageable.
|Raw data is accessible through data lakes, but often BI analysts will only have access to the data warehouse where engineers load selected data.
|Structured, semi-structured and unstructured support
|Data lakes perform ELT (Extract, Load, Transform), meaning they don’t transform data before loading it, as they do not implement a schema for the data to fit. Instead, the transformation and schema are verified when a query is required. This is the process previously mentioned as a schema-on-read approach.
|Governance & Security
|All the data is stored and managed as files. This doesn’t provide fine-grained access control on the contents of files, but only coarse-grained access control.
|Upserts & purging
|Any data manipulation language (DML) operation on a data lake results in a file modification. A new file is created, and additional metadata operations must be made to keep track of the changes. Managing and deleting file versions is a complex task in a data lake.
|Thanks to decoupled compute and storage, data lakes can scale independently. Cloud-based data lakes offer both storage and computing, making them quite scalable in comparison to Data Warehouses. Additionally, distributed file systems allow for increased storage capacity. On the negative side, due to the nature of the two-tier architecture, the scalability issues of data warehouses arise. On top of that, data is constantly transformed and processed towards data warehouses, which adds further costs and duplication of the data in both data lakes and data warehouses.
|Data lakes can’t match the performances of data warehouses even with engines like Spark. Additionally, the two-tier architecture is highly complex for users because data goes first into data lakes and then into data warehouses, creating complexities, delays, and new failure modes. This brings considerable performance issues when compared to plain data warehouses.
|Since data lakes lack fine-grained governance and ACID compliance, there are potential issues with the consistency of the data. This is especially an issue when there are multiple readers and writers. There’s also the complexity of schema mismatch due to the lack of schema enforcement in an environment that relies on batch and streaming data from multiple heterogeneous sources. Additionally, in the case of two-tier architecture, continuous engineering is required to ETL data between data warehouses and data lakes. Each transformation incurs risks of failures that reduce data quality. This also results in stale data in the data warehouse, as transformations from data lakes may take days to load.
|ML & DS applications, thanks to open data format (such as parquet and orc) and engines like Spark, are directly accessible to a wide range of other analytics engines, such as machine learning systems. For BI and high performant SQL applications, it is required to do ETL pipelines towards a Data warehouse.
As a result, data lakes bring efficient capabilities to the openness of the data and the cost of data storage. Additionally, they are suitable for machine learning and artificial intelligence algorithms, thanks to their support of various processing frameworks (allowing the use of python libraries) and give access to large amounts of raw data.
On the downside, the two-tier architecture results in complex ETL pipelines due to the significant movement, processing, and duplication of data towards data warehouses. Operationalizing and governance of this data architecture also become a challenge due to cost and complexity. All these results in data swamps and stale data.
In 2019 Databricks released the paper Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores introducing the concept of data lakehouse and Delta Tables. They intended to add a storage layer alongside Apache Spark, allowing for transactions and enforcing schema-on-write in an object-store. On the side, Netflix and Uber had released similar capabilities through Apache Iceberg and Apache Hudi, respectively. This would make the use of data warehouses redundant on top of a data lake.
In its architecture, a data lakehouse aims to provide data governance capabilities to a data lake while also reducing the operational costs of the aforementioned two-tier architecture. To achieve this, two features become vital. The first is the use of open file formats, such as Parquet and ORC, to facilitate essential statistics and allow for predefined data schemas. The second is the low-cost data storage system of a data lake, as decoupled compute and storage will allow for multiple processing engines.
But this doesn’t provide warehouse capabilities such as in-depth data management, versioning, or schema enforcement. ACID transactions are required, which were previously impossible in a distributed system where everything is on object storage.
Lakehouse architecture (image above) embraces this ACID paradigm by leveraging a metadata layer (e.g., Hive metastore, HDFS) and, more precisely, a storage abstraction framework (Apache Iceberg, Apache Hudi, Delta Tables). These open table formats will allow the metadata layer to register changes as transactions while handling concurrency.
Let’s dive into each point of comparison:
This architecture allows key warehouse capabilities to exist on a data lake. Furthermore, lakehouse solutions implement other optimizations on the engine layer (through Spark or Flink) to optimize query performance, such as caching, auxiliary data structures (indexes and statistics), and data layout optimizations. In comparison to data lakes, they reduced data redundancy and data staleness due to a single all-purpose data storage, reduced the so-called data swamps as data is now versioned, and added governance and security layers on top of it.
On the negative side, data lakehouse architecture is relatively new and immature, and some added features are still on a to-do list. There are also complaints on different topics, like Delta Lake’s reliance on querying uniquely through Delta Lake tables and not external tables or the complexities of using notebooks compared to the simple interface of modern Data warehouses.
The convergence of data warehouses and lakes towards each other has brought us the new lakehouse architecture, but let’s summarize how each one stacks against the data lakehouse:
- If you need high-performance Business Analytics while having access to fine-grained data governance, data warehouses are your choice. The high performance of a warehouse is unmatched by the others. Nevertheless, they are challenging and expensive to scale and lack the flexibility to efficiently handle all data types. If high transaction throughput and different data types are a requirement, lakehouse architecture can be a solution. The main argument against this transition is the complex migration of such different architectures.
- If you want to scale and process petabytes of data affordably, having storage while preserving computing and provide high throughput (in opposition to low latency access), data lakes are your choice. On the other hand, data lakes do not provide fine-grained access control or high-performance business analytics. In case you need this, a transition to data lakehouse might be possible and easier to perform due to these architecture reliance on similar distributed technology.
We also want to briefly mention modern cloud data warehouses (such as Snowflakes, Clickhouse, Azure Synapse) as they provide comparable solutions to lakehouses. Nevertheless, they are slightly different, as they behave more as a data warehouse trying to adopt data lake properties like decoupled compute and storage.